function c1()
x::UInt64 = 0
while x<= (10^8 * 10)
x+=1
end
end
function c2()
x::UInt64 = 0
while x<= (10^9)
x+=1
end
end
function c3()
x::UInt64 = 0
y::UInt64 = 10^8 * 10
while x<= y
x+=1
end
end
function c4()
x::UInt64 = 0
y::UInt64 = 10^9
while x<= y
x+=1
end
end
Should be the same, right?
#time c1()
0.019102 seconds (40.99 k allocations: 2.313 MiB)
#time c1()
0.000003 seconds (4 allocations: 160 bytes)
#time c2()
9.205925 seconds (47.89 k allocations: 2.750 MiB)
#time c2()
9.015212 seconds (4 allocations: 160 bytes)
#time c3()
0.019848 seconds (39.23 k allocations: 2.205 MiB)
#time c3()
0.000003 seconds (4 allocations: 160 bytes)
#time c4()
0.705712 seconds (47.41 k allocations: 2.719 MiB)
#time c4()
0.760354 seconds (4 allocations: 160 bytes)
This is about Julia's compile-time optimization of literals using power-by-squaring. Julia is able to optimize if the exponent can be reached via solely power-by-squaring or the power is 0,1,2,3. This is I believe done through lowering x^p to x^Val{p} for integer p and using compiler-specialization (or inlining plus a kind of metaprogramming, I am not sure what is the correct term here but it's like something you will find in Lisp; similar techniques is used for source-to-source auto-differentiation in Julia, see Zygote.jl) techniques to lower the code to a constant if p is 0,1,2,3 or a power of 2.
Julia lowers 10^8 to inlined literal_pow (and then power_by_squaring) and this gets lowered down to a constant then julia lowers constant * 10 to get another constant and then realizes all the while loop is unnecessary and removes the loop and so on, all at compile-time.
If you change 10^8 with 10^7 in c1 you will see that it will evaluate the number and the loop at runtime. However, if you replace 10^8 with 10^4 or 10^2 you will see that it will handle all the computation at compile time. I think julia is not set specifically to compile-time optimize if the exponent is power of 2 but instead the compiler turns out to be able to optimize (lowering the code to a constant) the code just for that case.
The case in which p is 1,2,3 is hard-coded in Julia. This is optimized again through lowering the code to inlined version of literal_pow and then compile-specialization.
You can use #code_llvm and #code_native macros to see what's going on. Let's try.
julia> f() = 10^8*10
julia> g() = 10^7*10
julia> #code_native f()
.text
; Function f {
; Location: In[101]:2
movl $1000000000, %eax # imm = 0x3B9ACA00
retq
nopw %cs:(%rax,%rax)
;}
julia> #code_native g()
.text
; Function g {
; Location: In[104]:1
; Function literal_pow; {
; Location: none
; Function macro expansion; {
; Location: none
; Function ^; {
; Location: In[104]:1
pushq %rax
movabsq $power_by_squaring, %rax
movl $10, %edi
movl $7, %esi
callq *%rax
;}}}
; Function *; {
; Location: int.jl:54
addq %rax, %rax
leaq (%rax,%rax,4), %rax
;}
popq %rcx
retq
;}
See f() turns out to be just a constant, while g() will evaluate stuff at run-time.
I think julia started this integer exponentiation trick around this commit if you would like to dig more.
EDIT: Let's compile-time optimize c2
I have also prepared a function to compute integer-integer exponents, with which julia will also optimize for non-power-of-2 exponents. I am not sure it is correct in all cases, though.
#inline function ipow(base::Int, exp::Int)
result = 1;
flag = true;
while flag
if (exp & 1 > 0)
result *= base;
end
exp >>= 1;
base *= base;
flag = exp != 0
end
return result;
end
Now replace your 10^9 in c2 with ipow(10,9), and enjoy the power of compile-time optimization.
Also see this question for power-by-squaring.
Please don't use this function as is, since it tries to inline all the exponentiation, whether or not it consists of literals. You wouldn't want that.
2nd UPDATE: Check out hckr answer. Much better than mine.
UPDATE: This is not a comprehensive answer. Just as much as I've been able to puzzle out, and I've had to give up for now due to time constraints.
I'm probably not the best person to answer this question, since as far as compiler optimization is concerned, I know just enough to be dangerous. Hopefully someone who understands Julia's compiler a little better stumbles across this question and can give a more comprehensive response, because from what I can see, your c2 function is doing an awful lot of work that it shouldn't need to.
So, there are at least two issues at play here. Firstly, as it stands, both c1 and c2 will always return nothing. For some reason I don't understand, the compiler is able to work this out in the case of c1, but not in the case of c2. Consequently, after compilation, c1 runs almost instantly because the loop in the algorithm is never actually performed. Indeed:
julia> #btime c1()
1.535 ns (0 allocations: 0 bytes)
You can also see this using #code_native c1() and #code_native c2(). The former is only a couple of lines long while the latter contains many more instructions. Also worth noting that the former does not contain any references to the function <= indicating that the condition in the while loop has been completely optimized out.
We can deal with this first issue by adding a return x statement at the bottom of both your functions, which forces the compiler to actually engage with the question of what the final value of x will be.
However, if you do this, you'll note that c1 is still about 10 times faster than c2, which is the second puzzling issue about your example.
It seems to me that even with a return x, a sufficiently clever compiler has all the information it should need to skip the loop entirely. That is, it knows at compile time the start value of x, the exact value of the transformation inside the loop, and the exact value of the terminating condition. Surprisingly enough, if you run #code_native c1() (after adding return x at the bottom), you'll notice that it has indeed worked out the function return value right there in the native code (cmpq $1000000001):
julia> #code_native c1()
.text
; Function c1 {
; Location: REPL[2]:2
movq $-1, %rax
nopw (%rax,%rax)
; Location: REPL[2]:3
; Function <=; {
; Location: int.jl:436
; Function <=; {
; Location: int.jl:429
L16:
addq $1, %rax
cmpq $1000000001, %rax # imm = 0x3B9ACA01
;}}
jb L16
; Location: REPL[2]:6
retq
nopl (%rax)
;}
so I'm not really sure why it is still doing any work at all!
For reference, here is the output of #code_native c2() (after adding return x):
julia> #code_native c2()
.text
; Function c2 {
; Location: REPL[3]:2
pushq %r14
pushq %rbx
pushq %rax
movq $-1, %rbx
movabsq $power_by_squaring, %r14
nopw %cs:(%rax,%rax)
; Location: REPL[3]:3
; Function literal_pow; {
; Location: none
; Function macro expansion; {
; Location: none
; Function ^; {
; Location: intfuncs.jl:220
L32:
addq $1, %rbx
movl $10, %edi
movl $9, %esi
callq *%r14
;}}}
; Function <=; {
; Location: int.jl:436
; Function >=; {
; Location: operators.jl:333
; Function <=; {
; Location: int.jl:428
testq %rax, %rax
;}}}
js L59
cmpq %rax, %rbx
jbe L32
; Location: REPL[3]:6
L59:
movq %rbx, %rax
addq $8, %rsp
popq %rbx
popq %r14
retq
nopw %cs:(%rax,%rax)
;}
There is clearly a lot of additional work going on here for c2 that doesn't make much sense to me. Hopefully someone more familiar with Julia's internals can shed some light on this.
Related
I am a beginner in Assembly and i have a simple question.
This is my code :
BITS 64 ; 64−bit mode
global strchr ; Export 'strchr'
SECTION .text ; Code section
strchr:
mov rcx, -1
.loop:
inc rcx
cmp byte [rdi+rcx], 0
je exit_null
cmp byte [rdi+rcx], sil
jne .loop
mov rax, [rdi+rcx]
ret
exit_null:
mov rax, 0
ret
This compile but doesn't work. I want to reproduce the function strchr as you can see. When I test my function with a printf it crashed ( the problem isn't the test ).
I know I can INC rdi directly to move into the rdi argument and return it at the position I want.
But I just want to know if there is a way to return rdi at the position rcx to fix my code and probably improve it.
Your function strchr seems to expect two parameters:
pointer to a string in RDI, and
pointer to a character in RSI.
Register rcx is used as index inside the string? In this case you should use al instead of cl. Be aware that you don't limit the search size. When the character refered by RSI is not found in the string, it will probably trigger an exception. Perhaps you should test al loaded from [rdi+rcx] and quit further searching when al=0.
If you want it to return pointer to the first occurence of character
inside the string, just
replace mov rax,[rdi+rcx] with lea rax,[rdi+rcx].
Your code (from edit Version 2) does the following:
char* strchr ( char *p, char x ) {
int i = -1;
do {
if ( p[i] == '\0' ) return null;
i++;
} while ( p[i] != x );
return * (long long*) &(p[i]);
}
As #vitsoft says, your intention is to return a pointer, but in the first return (in assembly) is returning a single quad word loaded from the address of the found character, 8 characters instead of an address.
It is unusual to increment in the middle of the loop. It is also odd to start the index at -1. On the first iteration, the loop continue condition looks at p[-1], which is not a good idea, since that's not part of the string you're being asked to search. If that byte happens to be the nul character, it'll stop the search right there.
If you waited to increment until both tests are performed, then you would not be referencing p[-1], and you could also start the index at 0, which would be more usual.
You might consider capturing the character into a register instead of using a complex addressing mode three times.
Further, you could advance the pointer in rdi and forgo the index variable altogether.
Here's that in C:
char* strchr ( char *p, char x ) {
for(;;) {
char c = *p;
if ( c == '\0' )
break;
if ( c == x )
return p;
p++;
}
return null;
}
Thanks to your help, I finally did it !
Thanks to the answer of Erik, i fixed a stupid mistake. I was comparing str[-1] to NULL so it was making an error.
And with the answer of vitsoft i switched mov to lea and it worked !
There is my code :
strchr:
mov rcx, -1
.loop:
inc rcx
cmp byte [rdi+rcx], 0
je exit_null
cmp byte [rdi+rcx], sil
jne .loop
lea rax, [rdi+rcx]
ret
exit_null:
mov rax, 0
ret
The only bug remaining in the current version is loading 8 bytes of char data as the return value instead of just doing pointer math, using mov instead of lea. (After various edits removed and added different bugs, as reflected in different answers talking about different code).
But this is over-complicated as well as inefficient (two loads, and indexed addressing modes, and of course extra instructions to set up RCX).
Just increment the pointer since that's what you want to return anyway.
If you're going to loop 1 byte at a time instead of using SSE2 to check 16 bytes at once, strchr can be as simple as:
;; BITS 64 is useless unless you're writing a kernel with a mix of 32 and 64-bit code
;; otherwise it only lets you shoot yourself in the foot by putting 64-bit machine code in a 32-bit object file by accident.
global mystrchr
mystrchr:
.loop: ; do {
movzx ecx, byte [rdi] ; c = *p;
cmp cl, sil ; if (c == needle) return p;
je .found
inc rdi ; p++
test cl, cl
jnz .loop ; }while(c != 0)
;; fell out of the loop on hitting the 0 terminator without finding a match
xor edi, edi ; p = NULL
; optionally an extra ret here, or just fall through
.found:
mov rax, rdi ; return p
ret
I checked for a match before end-of-string so I'd still have the un-incremented pointer, and not have to decrement it in the "found" return path. If I started the loop with inc, I could use an [rdi - 1] addressing mode, still avoiding a separate counter. That's why I switched up the order of which branch was at the bottom of the loop vs. your code in the question.
Since we want to compare the character twice, against SIL and against zero, I loaded it into a register. This might not run any faster on modern x86-64 which can run 2 loads per clock as well as 2 branches (as long as at most one of them is taken).
Some Intel CPUs can micro-fuse and macro-fuse cmp reg,mem / jcc into a single load+compare-and-branch uop for the front-end, at least when the memory addressing mode is simple, not indexed. But not cmp [mem], imm/jcc, so we're not costing any extra uops for the front-end on Intel CPUs by separately loading into a register. (With movzx to avoid a false dependency from writing a partial register like mov cl, [rdi])
Note that if your caller is also written in assembly, it's easy to return multiple values, e.g. a status and a pointer (in the not-found case, perhaps to the terminating 0 would be useful). Many C standard library string functions are badly designed, notably strcpy, to not help the caller avoid redoing length-finding work.
Especially on modern CPUs with SIMD, explicit lengths are quite useful to have: a real-world strchr implementation would check alignment, or check that the given pointer isn't within 16 bytes of the end of a page. But memchr doesn't have to, if the size is >= 16: it could just do a movdqu load and pcmpeqb.
See Is it safe to read past the end of a buffer within the same page on x86 and x64? for details and a link to glibc strlen's hand-written asm. Also Find the first instance of a character using simd for real-world implementations like glibc's using pcmpeqb / pmovmskb. (And maybe pminub for the 0-terminator check to unroll over multiple vectors.)
SSE2 can go about 16x faster than the code in this answer for non-tiny strings. For very large strings, you might hit a memory bottleneck and "only" be about 8x faster.
This is more an academic exercise than anything else, but I'm looking to write a recursive function in assembly, that, if it receives and "interrupt signal" it returns to the main function, and not just the function that invoked it (which is usually the same recursive function).
For this test, I'm doing a basic countdown and printing one-character digits (8...7...6...etc.). To simulate an "interrupt", I am using the number 7, so when the function hits 7 (if it starts above that), it will return a 1 meaning it was interrupted, and if it wasn't interrupted, it'll countdown to zero. Here is what I have thus far:
.globl _start
_start:
# countdown(9);
mov $8, %rdi
call countdown
# return 0;
mov %eax, %edi
mov $60, %eax
syscall
print:
push %rbp
mov %rsp, %rbp
# write the value to a memory location
pushq %rdi # now 16-byte aligned
add $'0', -8(%rbp)
movb $'\n', -7(%rbp)
# do a write syscall
mov $1, %rax # linux syscall write
mov $1, %rdi # file descriptor: stdout=1
lea -8(%rbp), %rsi # memory location of string goes in rsi
mov $2, %rdx # length: 1 char + newline
syscall
# restore the stack
pop %rdi
pop %rbp
ret;
countdown:
# this is the handler to call the recursive function so it can
# pass the address to jump back to in an interrupt as one of the
# function parameters
# (%rsp) currntly holds the return address, and let's pass that as the second argument
mov %rdi, %rdi # redundant, but for clarity
mov (%rsp), %rsi # return address to jump
call countdown_recursive
countdown_recursive:
# bool countdown(int n: n<10, return_address)
# ...{
push %rbp
mov %rsp, %rbp
# if (num<0) ... return
cmp $0, %rdi
jz end
# imaginary interrupt on num=7
cmp $7, %rdi
jz fast_ret
# else...printf("%d\n", num);
push %rsi
push %rdi
call print
pop %rdi
pop %rsi
# --num
dec %rdi
# countdown(num)
call countdown_recursive
end:
# ...}
mov $0, %eax
mov %rbp, %rsp
pop %rbp
ret
fast_ret:
mov $1, %eax
jmp *%rsi
Does the above look like a valid approach, passing the memory address I want to go back to in rsi? The function was incredibly tricky for me to write, but I think mainly due to the fact that I'm pretty new/raw with assembly.
As well as returning to this alternate return address, you also need to restore the caller's (call-preserved) registers, not just the ones of your most recent parent. That includes RSP.
You're basically trying to re-invent C's setjmp / longjmp which does exactly this, including resetting the stack pointer back to the scope where you called setjmp. I think a few of the questions in SO's setjmp tag are about about implementing your own setjmp / longjmp in asm.
Also, to make this more efficient you might want to use a custom calling convention where the return address pointer (or a jmpbuf pointer after implementing the above) is in a call-preserved register like R15, so you don't have to save/restore it around print calls inside the body of your recursive function.
C language
In the C programming language, it's easy to have tail recursion:
int foo(...) {
return foo(...);
}
Just return as is the return value of the recursive call. It is especially important when this recursion may repeat a thousand or even a million times. It would use a lot of memory on the stack.
Rust
Now, I have a Rust function that might recursively call itself a million times:
fn read_all(input: &mut dyn std::io::Read) -> std::io::Result<()> {
match input.read(&mut [0u8]) {
Ok ( 0) => Ok(()),
Ok ( _) => read_all(input),
Err(err) => Err(err),
}
}
(this is a minimal example, the real one is more complex, but it captures the main idea)
Here, the return value of the recursive call is returned as is, but:
Does it guarantee that the Rust compiler will apply a tail recursion?
For instance, if we declare some variable that needs to be destroyed like a std::Vec, will it be destroyed just before the recursive call (which allows for tail recursion) or after the recursive call returns (which forbids the tail recursion)?
Shepmaster's answer explains that tail call elimination is merely an optimization, not a guarantee, in Rust. But "never guaranteed" doesn't mean "never happens". Let's take a look at what the compiler does with some real code.
Does it happen in this function?
As of right now, the latest release of Rust available on Compiler Explorer is 1.39, and it does not eliminate the tail call in read_all.
example::read_all:
push r15
push r14
push rbx
sub rsp, 32
mov r14, rdx
mov r15, rsi
mov rbx, rdi
mov byte ptr [rsp + 7], 0
lea rdi, [rsp + 8]
lea rdx, [rsp + 7]
mov ecx, 1
call qword ptr [r14 + 24]
cmp qword ptr [rsp + 8], 1
jne .LBB3_1
movups xmm0, xmmword ptr [rsp + 16]
movups xmmword ptr [rbx], xmm0
jmp .LBB3_3
.LBB3_1:
cmp qword ptr [rsp + 16], 0
je .LBB3_2
mov rdi, rbx
mov rsi, r15
mov rdx, r14
call qword ptr [rip + example::read_all#GOTPCREL]
jmp .LBB3_3
.LBB3_2:
mov byte ptr [rbx], 3
.LBB3_3:
mov rax, rbx
add rsp, 32
pop rbx
pop r14
pop r15
ret
mov rbx, rax
lea rdi, [rsp + 8]
call core::ptr::real_drop_in_place
mov rdi, rbx
call _Unwind_Resume#PLT
ud2
Notice this line: call qword ptr [rip + example::read_all#GOTPCREL]. That's the (tail) recursive call. As you can tell from its existence, it was not eliminated.
Compare this to an equivalent function with an explicit loop:
pub fn read_all(input: &mut dyn std::io::Read) -> std::io::Result<()> {
loop {
match input.read(&mut [0u8]) {
Ok ( 0) => return Ok(()),
Ok ( _) => continue,
Err(err) => return Err(err),
}
}
}
which has no tail call to eliminate, and therefore compiles to a function with only one call in it (to the computed address of input.read).
Oh well. Maybe Rust isn't as good as C. Or is it?
Does it happen in C?
Here's a tail-recursive function in C that performs a very similar task:
int read_all(FILE *input) {
char buf[] = {0, 0};
if (!fgets(buf, sizeof buf, input))
return feof(input);
return read_all(input);
}
This should be super easy for the compiler to eliminate. The recursive call is right at the bottom of the function and C doesn't have to worry about running destructors. But nevertheless, there's that recursive tail call, annoyingly not eliminated:
call read_all
Tail call optimization is not guaranteed to happen in C, either. No compiler I tried would be convinced to turn this into a loop on its own initiative.
Since version 13, clang supports a non-standard musttail attribute you can add to tail calls that should be eliminated. Adding this attribute to the C code successfully eliminates the tail call. However, rustc currently has no equivalent attribute (although the become keyword is reserved for this purpose).
Does it ever happen in Rust?
Okay, so it's not guaranteed. Can the compiler do it at all? Yes! Here's a function that computes Fibonacci numbers via a tail-recursive inner function:
pub fn fibonacci(n: u64) -> u64 {
fn f(n: u64, a: u64, b: u64) -> u64 {
match n {
0 => a,
_ => f(n - 1, a + b, a),
}
}
f(n, 0, 1)
}
Not only is the tail call eliminated, the whole fibonacci_lr function is inlined into fibonacci, yielding only 12 instructions (and not a call in sight):
example::fibonacci:
push 1
pop rdx
xor ecx, ecx
.LBB0_1:
mov rax, rdx
test rdi, rdi
je .LBB0_3
dec rdi
add rcx, rax
mov rdx, rcx
mov rcx, rax
jmp .LBB0_1
.LBB0_3:
ret
If you compare this to an equivalent while loop, the compiler generates almost the same assembly.
What's the point?
You probably shouldn't be relying on optimizations to eliminate tail calls, either in Rust or in C. It's nice when it happens, but if you need to be sure that a function compiles into a tight loop, the surest way, at least for now, is to use a loop.
Neither tail recursion (reusing a stack frame for a tail call to the same function) nor tail call optimization (reusing the stack frame for a tail call to any function) are ever guaranteed by Rust, although the optimizer may choose to perform them.
if we declare some variable that needs to be destroyed
It's my understanding that this is one of the sticking points, as changing the location of destroyed stack variables would be contentious.
See also:
Recursive function calculating factorials leads to stack overflow
RFC 81: guaranteed tail call elimination
RFC 1888: Proper tail calls
The C and assembly code:
long pcount_r(unsigned long x) {
if (x == 0) return 0;
else return (x & 1) + pcount_r (x>>1);
}
pcoutn_r:
movl $0, %eax
testq %rdi, %rdi
je .L6
pushq %rbx
movq %rdi, %rbx
andl $1, %ebx
shrq %rdi
call pcount_r
addq %rbx, %rax
popq %rbx
.L6:
rep; ret
If passing function a value x = 5 (binary representation is 0101). Register %rdi is the x value, and %rbx holds x value by "movq %rdi, %rbx" before pass "x>>1" to the next recursive call. Therefore, %rbx holds 0101, then 010, then 01.
The first call of pcount_r:
%rdi %rbx
0101 --> 0101
The second call of pcount_r:
%rdi %rbx
010 --> 010
The third call and etc...
%rbx can only hold one value at the time, To me, it seems just overwriting the previous value of %rbx, rather than saving data on a stack. My question is: how can %rbx restore the previous %rbx value when a recursive call ends?
more details about this function can be found in this link.
Recursive Procedures at time 55:00
I see. My mistake is that I thought register %rbx is holding the x value. But In fact, %rbx is just holding x value temporarily, assembly code "pushq %rbx" pushes %rbx value to the stack before passing the next %rdi to %rbx. Similarly, the value is restored by "popq %rbx".
I am re-writing an encryption/compression library and it seems like it is getting to be a lot of processing per bytes processed. I would prefer to use an enumeration type when choosing which of several limited ways the encryption can go (the proper way), but when those paths become cyclical, I have to add extra code to test for type'last and type'first. I can always just write such a condition in for the type, or assign the addition/subtraction operator on the type a function to wrap around the result, but that is more code and processing that will add up quickly when it has to run every eight bytes along with everything else. Is there a way to make the operation about as efficient as if it were a simple "mod" type, like
type Modular is mod 64 ....;
for ......;
pragma ....;
type Frequency_Counter is array(Modular) of Long_Integer;
Head : Modular := (others => 0);
Freq : Frequency_Counter(Size) := (others => 0);
Encryption_Label : Modular := Hash3;
Block_Sample : Modular := Hash5;
...
Hash3 := Hash3 + 1;
Freq (Hash3):= Freq(Hash3) + 1; -- Here is where my made-on-the-fly example is focused
I think I can make the whole algorithm more efficient and use enumeration types if I can just get the enumeration type to do math in the processor in the same number of cycles as with a mod type math. I have gotten a little creative in thinking of a way, but they were too obviously not right for me to use any of them as an example. The only thing I can think might be possible exceeds my skill, and that is making a procedure using inline ASM (gas assembly language syntax) to make the operation very direct to the processor.
PS: I know this is a minor gain, alone. Any gain is appropriate for the application.
Not sure that it’ll make much difference!
Given this
package Cyclic is
type Enum is (A, B, C, D, E);
type Modular is mod 5;
function Next_Enum (En : Enum) return Enum is
(if En = Enum'Last then Enum'First else Enum'Succ (En)) --'
with Inline_Always;
end Cyclic;
and
with Cyclic; use Cyclic;
procedure Cyclic_Use (N : Natural; E : in out Enum; M : in out Modular) is
begin
begin
for J in 1 .. N loop
E := Next_Enum (E);
end loop;
end;
begin
for J in 1 .. N loop
M := M + 1;
end loop;
end;
end Cyclic_Use;
and compiling using GCC 5.2.0 with -O3 (gnatmake -O3 -c -u -f cyclic_use.adb -cargs -S), the x86_64 assembler generated for the two loops is
(enumeration)
L3:
leal 1(%rsi), %ecx
addl $1, %eax
cmpb $4, %sil
cmove %r8d, %ecx
cmpl %eax, %edi
movl %ecx, %esi
jne L3
(modular)
L4:
leal -4(%rdx), %ecx
addl $1, %eax
cmpb $3, %dl
leal 1(%rdx), %r8d
movl %ecx, %edx
cmovle %r8d, %edx
cmpl %eax, %edi
jne L4
I don’t pretend to know x86_64 assembler, and I don’t know why the enumeration version compares against 4 while the modular version compares against 3, but these look very similar to me! but the enumeration version is one instruction shorter ...