Let's consider this small function
let f x =
match x with
0 -> 1 |
_ -> x ;;
This is logically equivalent to
let f x =
if x = 0 then 1 else x ;;
What's then the purpose of pattern matching if we can achieve the same using if/else?
In your precise example, a pattern matching does not bring a lot, because you have only 2 cases, and more importantly because your pattern does not have any variable. Just write this example with if/then/else, you will understand:
let rec map f = function
[] -> []
| a::l -> let r = f a in r :: map f l
Note also that pattern matching warns you if you have redundant cases or if you forgot some cases.
Usually pattern matching allows a compiler to apply more aggressive optimization techniques. In the if/then/else expression the condition is an arbitrary expression, that can contain side effects. For example, the equality operator may do anything, so the compiler cannot, in general, rely that x=0 means that x is equal to zero. In pattern matching clauses are always constants, and matching means syntactic equality, that cannot be overloaded, so it can be easily compiled directly to assembly comparison operation. In the example with if, comparison will be in general compiled to a function call (but afaik in this case the compiler is clever enough, and the generated code would be the same).
But the main difference between if/then/else and pattern matching is that the latter is run in parallel and compiles into binary search trees embedded into the assembly, when if/then/else is just a linear sequence of comparisons (see this for more information).
Update
To satisfy OP curiosity I added some assembly output. It is not required to understand x86 assembler, one can just compare a number of instructions, to get a basic idea. You will see.
As I predicted, indeed, compiler emitted nearly the same code, that has the same performance for you example:
Function with_match has compiled into efficient code (notice that 0 in OCaml parlance is 1)
with_match:
.L101:
cmpq $1, %rax
je .L100
ret
.L100:
movq $3, %rax
ret
For function with_if compiler also emitted optimal code. The only difference is that in with_if function, the condition in jump instruction is inverted.
with_if:
.L103:
cmpq $1, %rax
jne .L102
movq $3, %rax
ret
.L102:
ret
This was possible, because compiler uses a trick, that allows him to treat
= as a special function, with some theory attached to it. But in general this is not possible, as = can be arbitrary function. We can easily confuse compiler, by adding the following line to the start of the file:
let (=) x y = x = y
Now all tricks are disabled, and compiler emits this inefficient code.
with_if:
subq $8, %rsp
.L105:
movq %rax, 0(%rsp)
movq $1, %rsi
movq %rax, %rdi
movq _caml_equal, %rax
call _caml_c_call
.L106:
movq _caml_young_ptr, %r11
movq (%r11), %r15
cmpq $1, %rax
je .L104
movq $3, %rax
addq $8, %rsp
ret
.L104:
movq 0(%rsp), %rax
addq $8, %rsp
ret
With all that said, I would like to stress, that one shouldn't prefer match over if or vice verse. A construct that is more clean and results in more readable code should be chosen. And ocaml compiler is rather good, and will emit efficient code for you.
I personally lean to more to matches, because this reflects my way of thinking. It is harder for me to reason in terms of if/then/else constructs, and whenever I read them, I mentally translate them into match with clauses. But this is my personal issue. Feel free to use whatever construct, that suits you better.
A partial pattern matching is detected :
type number = Zero | One | Two;
let f= function
Zero -> 0
| One -> 1 ;;
Warning 8: this pattern-matching is not exhaustive.
Here is an example of a value that is not matched:
Two
val f : number -> int = <fun>
Related
Following the approach of this answer I am trying to understand what happens exactly and how expressions and generated functions work in Julia within the concept of metaprogramming.
The goal is to optimize a recursive function using expressions and generated functions (for a concrete example you can have a look at the question answered in the link provided above).
Consider the following modified fibonacci function, in which I want to compute the fibonacci series up to n and multiply it by a number p.
The straightforward, recursive implementation would be
function fib(n::Integer, p::Real)
if n <= 1
return 1 * p
else
return n * fib(n-1, p)
end
end
As a first step, I could define a function which returns an expression instead of the computed value
function fib_expr(n::Integer, p::Symbol)
if n <= 1
return :(1 * $p)
else
return :($n * $(fib_expr(n-1, p)))
end
end
which, e.g. returns something like
julia> ex = fib_expr(3, :myp)
:(3 * (2 * (1myp)))
In this way I get an expression which is fully expanded and depends on the value assigned to the symbol myp. In this way I do not see the recursion anymore, basically I am metaprogramming: I created a function that creates another "function" (in this case we call it expression though).
I can now set myp = 0.5 and call eval(ex) to compute the result.
However, this is slower than the first approach.
What I can do though, is to generate a parametric function in the following way
#generated function fib_gen{n}(::Type{Val{n}}, p::Real)
return fib_expr(n, :p)
end
And magically, calling fib_gen(Val{3}, 0.5) gets things done, and is incredibly fast.
So, what is going on?
To my understanding, in the first call to fib_gen(Val{3}, 0.5), the parametric function fib_gen{Val{3}}(...) gets compiled and its content is the fully expanded expression obtained through fib_expr(3, :p), i.e. 3*2*1*p with p substituted with the input value.
The reason why it is so fast then, is because fib_gen is basically just a series of multiplications, whereas the original fib has to allocate on the stack every single recursive call making it slower, am I correct?
To give some numbers, here is my short benchmark using BenchmarkTools.
julia> #benchmark fib(10, 0.5)
...
mean time: 26.373 ns
...
julia> p = 0.5
0.5
julia> #benchmark eval(fib_expr(10, :p))
...
mean time: 177.906 μs
...
julia> #benchmark fib_gen(Val{10}, 0.5)
...
mean time: 2.046 ns
...
I have many questions:
Why the second case is so slow?
What exactly is and means ::Type{Val{n}}? (I copied that from the answer linked above)
Because of the JIT compiler, sometimes I am lost in what happens at compile-time and at run-time, as it is the case here...
Furthermore, I tried to combine fib_expr and fib_gen in a single function according to
#generated function fib_tot{n}(::Type{Val{n}}, p::Real)
if n <= 1
return :(1 * p)
else
return :(n * fib_tot(Val{n-1}, p))
end
end
which however is slow
julia> #benchmark fib_tot(Val{10}, 0.5)
...
mean time: 4.601 μs
...
What am I doing wrong here? Is it even possible to combine fib_expr and fib_gen in a single function?
I realize this is more a monograph rather than a question, however, even though I read the metaprogramming section few times, I am having a hard time to grasp everything, in particular with an applied example such as this one.
A monograph in response:
Metaprogramming basics
It will be easier to start with "normal" macros first. I'll relax the definition you used a bit:
function fib_expr(n::Integer, p)
if n <= 1
return :(1 * $p)
else
return :($n * $(fib_expr(n-1, p)))
end
end
That allows to pass in more than just symbols for p, like integer literals or whole expressions. Given this, we can define a macro for the same functionality:
macro fib_macro(n::Integer, p)
fib_expr(n, p)
end
Now, if #fib_macro 45 1 is used anywhere in the code, at compile time it will first be replaced by a long nested expression
:(45 * (44 * ... * (1 * 1)) ... )
and then compiled normally -- to a constant.
That's all there is to macros, really. Replacing syntax during compile time; and by recursion, this can be an arbitrarily long alteration between compiling, and evaluating functions on expressions. And for things that are essentially constant, but tedious to write otherwise, it is very useful: a bood example example is Base.Math.#evalpoly.
Evaluation at runtime?
But it has the problem that you cannot inspect values which are only known at runtime: you can't implement fib(n) = #fib_macro n 1, since at compile time, n is a symbol representing the parameter, and not a number you can dispatch on.
The next best solution to this would be to use
fib_eval(n::Integer) = eval(fib_expr(n, 1))
which works, but will repeat the compilation process every time it is called -- and that is much more overhead than the original function, since now at runtime, we perform the whole recursion on the expression tree and then call the compiler on the result. Not good.
Method dispatch & compilation
So we need a way to intermingle runtime and compile time. Enter #generated functions. These will at runtime dispatch on a type, and then work like a macro defining the function body.
First about type dispatch. If we have
f(x) = x + 1
and have a function call f(1), about the following will happen:
The type of the argument is determined (Int)
The method table of the function is consulted to find the best matching method
The method body is compiled for the specific Int argument type, if that hasn't been done before
The compiled method is evaluated on the concrete argument
If we then enter f(1.0), the same will happen again, with a new, different specialized method being compiled for Float64, based on the same function body.
Value types & singleton types
Now, Julia has the peculiar feature that you can use numbers as types. That means that the dispatch process outlined above will also work on the following function:
g(::Type{Val{N}}) where N = N + 1
That's a bit tricky. Remember that types are themselves values in Julia: Int isa Type.
Here, Val{N} is for every N a so-called singleton type having exactly one instance, namely Val{N}() -- just like Int is a type having many instances 0, -1, 1, -2, ....
Type{T} is also a singleton type, having as its single instance the type T. Int is a Type{Int}, and Val{3} is a Type{Val{3}} -- in fact, both are the only values of their type.
So, for each N, there is a type Val{N}, being the single instance of Type{Val{N}}. Thus, g will be dispatched and compiled for each single N. This is how we can dispatch on numbers as types. This already allows for optimization:
julia> #code_llvm g(Val{1})
define i64 #julia_g_61158(i8**) #0 !dbg !5 {
top:
ret i64 2
}
julia> #code_llvm f(1)
define i64 #julia_f_61076(i64) #0 !dbg !5 {
top:
%1 = shl i64 %0, 2
%2 = or i64 %1, 3
%3 = mul i64 %2, %0
%4 = add i64 %3, 2
ret i64 %4
}
But remember that it requires compilation for each new N at the first call.
(And fkt(::T) is just short for fkt(x::T) if you don't use x in the body.)
Integrating generating functions and value types
Finally to generated functions. They work as a slight modification of the above dispatch pattern:
The type of the argument is determined (Int)
The method table of the function is consulted to find the best matching method
The method body is treated as a macro and called with the Int argument type as a parameter, if that hasn't been done before. The resulting expression is compiled into a method.
The compiled method is evaluated on the concrete argument
This pattern allows to change the implementation for each type which the function is dispatched on.
For our concrete setting, we want to dispatch on the Val types representing the arguments of the Fibonacci sequence:
#generated function fib_gen{n}(::Type{Val{n}}, p::Real)
return fib_expr(n, :p)
end
You now see that your explanation was exactly right:
in the first call to fib_gen(Val{3}, 0.5), the parametric function
fib_gen{Val{3}}(...) gets compiled and its content is the fully
expanded expression obtained through fib_expr(3, :p), i.e. 3*2*1*p
with p substituted with the input value.
I hope that the whole story has also answered all three of your listed questions:
The implementation using eval replicates the recursion every time, plus the overhead of compilation
Val is a trick to lift numbers to types, and Type{T} the singleton type containing only T -- but I hope the examples were helpful enough
Compile time is not before execution, because of JIT -- it is every time a method gets compiled first time, because it get's called.
First of all, I am joining myself to the comments: your question is very well written & constructive.
I have reproduced your results using Julia 0.7-beta.
Difference between #generated fib_tot (one piece of code) and fib_gen (that calls fib_expr)
With my julia version results are identicals:
julia> #btime fib_tot(Val{10},0.5)
0.042 ns (0 allocations: 0 bytes)
1.8144e6
julia> #btime fib_gen(Val{10},0.5)
0.042 ns (0 allocations: 0 bytes)
1.8144e6
Sometimes breaking a function into multiple parts see official doc:performance tips can be useful, however in your peculiar case I do not see why this could be useful. At compile time Julia has everything it needs to optimize fib_tot. There is a branch if n<=1 however n is known at "compile time" thanks to the Type{Val{n}} trick and this branch should be removed without problem in the generated (specialized) code.
The Type{Val{n}} trick
To specialize functions, Julia inference is performed according to argument type and not according to argument value.
For instance a compiled version of foo(n::Int) = ... is not generated for each n value. You must define a type that depends on n value to reach this goal. This is precisely how Type{Val{n}} works: Val{n} is simply a parametrized empty structure:
struct Val{T} end
Hence, each Val{1}, Val{2}, ... Val{100}, ... is a different type. By consequence, if foo is defined as:
foo(::Type{Val{n}}) where {n} = ...
Each foo(Val{1}), foo(Val{2}), ... foo(Val{100}) will trigger a specialized foo version (because argument type is different).
The eval(fib_expr(n, 1)) case
This
julia> #btime eval(fib_expr(10, :p))
401.651 μs (99 allocations: 6.45 KiB)
1.8144e6
is slow because your expression is (re-)compiled every time. The problem can be avoided if you use a macro instead (see phg answer).
The fib version
.
julia> #btime fib(10,0.5)
30.778 ns (0 allocations: 0 bytes)
1.8144e6
There is only one compiled version of this fib function. By consequence, it must contain all the runtime branch tests etc... This explains how slow it is.
Just a remark about:
foo{n}(::Type{Val{n}}) deprecated syntax
The foo{n}(::Type{Val{n}}) syntax is deprecated, the new one is foo(::Type{Val{n}}) where {n}. You can read Julia doc, parametric methods for further details.
My Julia version:
julia> versioninfo()
Julia Version 0.7.0-beta.0
Commit f41b1ecaec (2018-06-24 01:32 UTC)
Platform Info:
OS: Linux (x86_64-pc-linux-gnu)
CPU: Intel(R) Xeon(R) CPU E5-2603 v3 # 1.60GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-6.0.0 (ORCJIT, haswell)
This is the second SML program I have been working on. These functions are mutually recursive. If I call odd(1) I should get true and even(1) I should get false. These functions should work for all positive integers. However, when I run this program:
fun
odd (n) = if n=0 then false else even (n-1);
and
even (n) = if n=0 then true else odd (n-1);
I get:
[opening test.sml]
test.sml:2.35-2.39 Error: unbound variable or constructor: even
val it = () : unit
How can I fix this?
The problem is the semicolon (;) in the middle. Semicolons are allowed (optionally) at the end of a complete declaration, but right before and is not the end of a declaration!
So the compiler is blowing up on the invalid declaration fun odd (n) = if n=0 then false else even (n-1) that refers to undeclared even. If it were to proceed, it would next blow up on the illegal occurrence of and at the start of a declaration.
Note that there are only two situations where a semicolon is meaningful:
the notation (...A... ; ...B... ; ...C...) means "evaluate ...A..., ...B..., and ...C..., and return the result of ...C....
likewise the notation let ... in ...A... ; ...B... ; ...C... end, where the parentheses are optional because the in ... end does an adequate job bracketing their contents.
if you're using the interactive REPL (read-evaluate-print loop), a semicolon at the end of a top-level declaration means "OK, now actually go ahead and elaborate/evaluate/etc. everything so far".
Idiomatic Standard ML doesn't really use semicolons outside of the above situations; but it's OK to do so, as long as you don't start thinking in terms of procedural languages and expecting the semicolons to "terminate statements", or anything like that. There's obviously a relationship between the use of ; in Standard ML and the use of ; in languages such as C and its syntactic descendants, but it's not a direct one.
I'm sure there's a didactic point in making these functions recursive, but here's some shorter ones:
fun even x = x mod 2 = 0
val odd = not o even
For example, below is a piece of C code and its assembly code generated by cc compiler.
// C code (pre K&R C)
foo(a, b) {
int c, d;
c = a;
d = b;
return c+d;
}
// corresponding assembly code generated by cc
.global _foo
.text
_foo:
~~foo:
~a=4
~b=6
~c=177770
~d=177766
jsr r5, csv
sub $4, sp
mov 4(r5), -10(r5)
mov 6(r5), -12(r5)
mov -10(r5), r0
add -12(r5), r0
jbr L1
L1: jmp cret
I can understand most of the code. But I don't know what does ~~foo: do. And where do the magic numbers come from in ~c=177770 and ~d=177766. The hardware is pdp-11/40.
The tildes look like data which determines the stack usage. You might find it helpful to recall that the pdp-11 used 16-bit integers, and that DEC preferred octal numbers over hexadecimal.
That
jsr r5, csv
is a way of making register 5 (r5) point to some data (perhaps the list of offsets).
The numbers correspond to offsets on the stack in octal. The caller is assumed to do something like
push a and b onto the stack (positive offsets)
push the return address onto the stack (offset=0)
possibly push other stuff in the csv function
c and d are local variables (negative offsets, hence the "17777x")
That line
~d=177776
looks odd - I'd expect
~d=177766
since it should be below c on the stack. The -10 and -12 offsets in the register operands look like they're also octal numbers. You should be able to match up the offsets with the variables, by context.
That's just an educated guess: I adapted the jsr+r5 idiom a while back in a text-editor.
The lines with tildes are symbol definitions. A clue for that is in the DECUS C Compiler Reference, found at
ftp://ftp.update.uu.se/pub/pdp11/rsx/lang/decusc/2.19/005003/CC.DOC
which says
3.3 Global Symbols Containing Radix-50 '$' and '.'
______ _______ __________ ________ ___
With this version of Decus C, it is possible to generate and
access global symbols which contain the Radix-50 '.' and '$'.
The compiler allows identifiers to contain the Ascii '$', which
becomes a Radix-50 '$' in the object code. The AS assembly code
shows this character as a tilde (~). The underscore character
(_) in a C program becomes a '.' in both the AS assembly
language and in the object code. This allows C programs to
access all global symbols:
extern int $dsw;
. . .
printf("Directive status = %06o\n", $dsw);
The above prints the current contents of the task's directive
status word.
So you could read
~a=4
as
$a=4
and see that $a is a (more or less) conventional symbol.
For example for:
type PERSONCV is
record
name: String ( 1..4 );
age: Integer;
cvtext: String ( 1..2000 );
end record;
N: constant := 40000;
persons : array ( 1..N ) of PERSONCV;
function jobcandidate return Boolean is
iscandidate: Boolean := False;
begin
for p of persons loop -- what code is generated for this?
if p.age >= 18 then
iscandidate := true;
exit;
end if;
end loop;
return iscandidate;
end;
In C/C++ the loop part would typically be:
PERSONCV * p; // address pointer
int k = 0;
while ( k < N )
{
p = &persons [ k ]; // pointer to k'th record
if ( p->age >= 18 )...
...
k++ ;
}
I have read that Ada uses Value semantics for records.
Does the Ada loop above copy the k'th record to loop variable p?
e.g. like this is in C/C++ :
PERSONCV p; // object/variable
int k = 0;
while ( k < N )
{
memcpy ( &p, &persons [ k ], sizeof ( PERSONCV ) ); // copies k'th elem
if ( p.age >= 18 )...
...
k++ ;
}
Assuming you are using GNAT, there are two avenues of investigation.
The switch -gnatG will regenerate an Ada-like representation of what the front end of the compiler is going to pass to the back end (before any optimisations). In this case, I see
function resander__jobcandidate return boolean is
iscandidate : boolean := false;
L_1 : label
begin
L_1 : for C8b in 1 .. 40000 loop
p : resander__personcv renames resander__persons (C8b);
if p.age >= 18 then
iscandidate := true;
exit;
end if;
end loop L_1;
return iscandidate;
end resander__jobcandidate;
so the question is, how does renames get translated? Given that the record size is 2008 bytes, the chances of the compiler generating a copy is pretty much zero.
The second investigatory approach is to keep the assembly code that the compiler normally emits to the assembler and then deletes, using the switch -S. This confirms that the code generated is like your first C++ version (for macOS).
As an interesting sidelight, Ada 2012 allows an alternate implementation of jobcandidate:
function jobcandidate2 return Boolean is
(for some p of persons => p.age >= 18);
which generates identical code.
I suspect that what you have read about Ada is wrong, and probably worse, is encouraging you to think about Ada in the wrong way.
Ada's intent is to encourage thinking in the problem domain, i.e. to specify what should happen, rather than thinking in the solution domain, i.e. to implement the fine details of exactly how.
So here the intent is to loop over all Persons, exit returning True on meeting the first over 18, otherwise return False.
And that's it.
By and large Ada mandates nothing about the details of how it's done, provided those semantics are satisfied.
Then, the intent is, you just expect the compiler to do the right thing.
Now an individual compiler may choose one implementation over another - or may switch between implementations according to optimisation heuristics, taking into account which CPU it's compiling for, as well as the size of the objects (will they fit into a register?) etc.
You could imagine a CPU with many registers, where a single cache line read makes the copy implementation faster than operating in place (especially if there are no modifications to write back to P's contents), or other target CPUs where the reverse was true. Why would you want to stop the compiler picking the better implementation?
A good example of this is Ada's approach to parameter passing to subprograms - name, value or reference semantics really don't apply - instead, you specify the parameter passing mode - in, out, or in out describing the information flow to (or from) the subprogram. Intuitive, provides semantics that can be more rigorously checked, and leaves the compiler free to pick the best (fastest, smallest, depending on your goal) implementation that correctly obeys those semantics.
Now it would be possible for a specific Ada compiler to make poor choices, and 30 years ago when computers were barely big enough to run an Ada compiler at all, you might well have found performance compromised for simplicity in early releases of a compiler.
But we have thirty more years of compiler development now, running on more powerful computers. So, today, I'd expect the compiler to normally make the best choice. And if you find a specific compiler missing out on performance optimisations, file an enhancement request. Ada compilers aren't perfect, just like any other compiler.
In this specific example, I'd normally expect P to be a cursor into the array, and operations to happen in-place, i.e. reference semantics. Or possibly a hybrid between forms, where one memory fetch into a register serves several operations, like a partial form of value semantics.
If your interest is academic, you can easily look at the assembly output from whatever compiler you're using and find out. Or write all three versions above and benchmark them.
Using a current compiler (GCC 7.0.0), I have copied your source to both an Ada program and a C++ program, using std:array<char, 4> etc. corresponding to String( 1..4 ) etc. Switches were simply -O2 for C++, and -O2 -gnatp for Ada, so as to use comparable settings regarding checked access to array elements, etc.
These are the results for jobcandidate:
C++:
movl $_ZN15Loop_Over_Array7personsE+4, %eax
movl $_ZN15Loop_Over_Array7personsE+80320004, %edx
jmp .L3
.L8:
addq $2008, %rax
cmpq %rdx, %rax
je .L7
.L3:
cmpl $17, (%rax)
jle .L8
movl $1, %eax
ret
.L7:
xorl %eax, %eax
ret
Ada:
movl $1, %eax
jmp .L5
.L10:
addq $1, %rax
cmpq $40001, %rax
je .L9
.L5:
imulq $2008, %rax, %rdx
cmpl $17, loop_over_array__persons-2004(%rdx)
jle .L10
movl $1, %eax
ret
.L9:
xorl %eax, %eax
ret
One difference I see is in how either implementation uses %edx and %eax; for going form one element of the array to the next, and testing whether the end has been reached. Ada seems to imulq the element size to set the cursor, C++ seems to addq it to the pointer.
I haven't measured performance.
I've seen that in libc.so the actual type of strcmp_sse to call is decided by the function strcmp itself.
Here it is the code:
strcmp:
.text:000000000007B9F0 cmp cs:__cpu_features.kind, 0
.text:000000000007B9F7 jnz short loc_7B9FE
.text:000000000007B9F9 call __init_cpu_features
.text:000000000007B9FE
.text:000000000007B9FE loc_7B9FE: ; CODE XREF: .text:000000000007B9F7j
.text:000000000007B9FE lea rax, __strcmp_sse2_unaligned
.text:000000000007BA05 test cs:__cpu_features.cpuid._eax, 10h
.text:000000000007BA0F jnz short locret_7BA2B
.text:000000000007BA11 lea rax, __strcmp_ssse3
.text:000000000007BA18 test cs:__cpu_features.cpuid._ecx, 200h
.text:000000000007BA22 jnz short locret_7BA2B
.text:000000000007BA24 lea rax, __strcmp_sse2
.text:000000000007BA2B
.text:000000000007BA2B locret_7BA2B: ; CODE XREF: .text:000000000007BA0Fj
.text:000000000007BA2B ; .text:000000000007BA22j
.text:000000000007BA2B retn
What I do not understand is that the address of the strcmp_sse function to call is placed in rax and never actually called. Therefore I am wondering: who is going do call *rax? When?
Linux dynamic linker supports a special symbol type called STT_GNU_IFUNC. Strcmp is likely implemented as an IFUNC. 'Regular' symbols in a dynamic library are nothing more but a mapping from a name to the address. IFUNCs are a bit more complex than that: the address isn't readily available, in order to obtain it the linker must execute a piece of code from the library itself. We are seeing an example of such a peice of code here. Note that in x86_64 ABI a function returns the result in RAX.
This technique is typically used to pick the optimal implementation based on the CPU features. Please note that the selection logic runs only once; all but the first call to strcmp are fast.