What is the inefficiency in this cairo code using alloc_locals - starknet

The following code:
func pow4(n) -> (m : felt):
alloc_locals
local x
jmp body if n != 0
[ap] = 0; ap++
ret
body:
x = n * n
[ap] = x * x; ap++
ret
end
func main():
pow4(n=5)
ret
end
is declared inefficient in the doc because of non-continuous memory.
I ran it and could not see any hole in the memory table:
Addr Value
-----------
⋮
###
⋮
1:0 2:0
1:1 3:0
1:2 5
1:3 1:2
1:4 0:21
1:5 25
1:6 625
so I don't understand where the problem is. I do see a hole with n=0 though:
Addr Value
-----------
⋮
###
⋮
1:0 2:0
1:1 3:0
1:2 0
1:3 1:2
1:4 0:21
⋮
1:6 0
that I can fix using:
jmp body if n != 0
x = 0
ret
but it's not what's suggested:
Move the instruction alloc_locals.
or Use tempvar instead of local.

You are correct that the inefficiency is due to the memory hole, and correct that it appears only for n=0. Holes do not cause an inefficiency just by existing, but rather, their existence usually means that an equivalent code could have executed using fewer memory cells in total (in this case, 6 instead of 7, in the memory section "1:"). To make it more efficient, one should aim to remove the hole (so that the parts before and after it become continuous), whereas your suggested solution just fills it (which the prover does anyway). So your solution would still use 7 memory cells in the memory segment; and will in fact also use 2 additional cells in the program section (for the x=0 command), so it is actually less efficient than leaving the hole empty: compile and check!
The inefficiency in the original code arises from the local variable x being assigned a memory cell even in the case n=0, despite not being used. To make it more efficient we would simply not want to declare and allocate x in this case.
This can be done by moving the alloc_locals command inside "body", so that it isn't executed (and the locals aren't allocated) in the n=0 case, as in the first suggestion: this saves one memory cell in the n=0 case and doesn't affect the n!=0 case. Note that alloc_locals does not have to appear right at the beginning of the code.
The second suggestion is to make x a tempvar instead of a local (and remove alloc_locals entirely). This again means no local variable is allocated in the n=0 case, thus again only 6 memory cells will be used instead of 7. And in fact, because the code is so simple, using the tempvar is also more efficient that declaring it as a local, at least when used as tempvar x = n * n, as it skips merges the ap += 1 command (which is what alloc_locals does in this case) with the x = n * n command, rather than run them separately.
Beyond discussing the theory, you should compile each of these options and compare the lengths of the code and memory segments obtained, and see what is really the most efficient and by how much: this is always true when optimizing, you should always check the actual performance.

So following #dan-carmon answer and for the sake of completeness, here is a summary of the possible implementations and their memory table "1" when n = 0 & n > 0
# n = 0 n = 5
1:0 2:0 1:0 2:0
1:1 3:0 1:1 3:0
1:2 0 1:2 5
1:3 1:2 1:3 1:2
1:4 0:14 1:4 0:14
1:5 0 1:5 25
1:6 625
Note however that the implementation using tempvar uses 2 slots less than the others in the program table "0".
Implementations tested
func pow4_reuse_slot(n) -> (m : felt):
alloc_locals
local x
jmp body if n != 0
x = 0
ret
body:
x = n * n
[ap] = x * x; ap++
ret
end
func pow4_alloc_in_body(n) -> (m : felt):
jmp body if n != 0
[ap] = 0; ap++
ret
body:
alloc_locals
local x
x = n * n
[ap] = x * x; ap++
ret
end
func pow4_use_tempvar(n) -> (m : felt):
jmp body if n != 0
[ap] = 0; ap++
ret
body:
tempvar x = n * n
[ap] = x * x
ret
end

Related

Wrong answer from spigot algorithm

I'm coding the spigot algorithm for displaying digits of pi in ada, but my output is wrong and I can't figure out why
I've tried messing with the range of my loops and different ways to output my data but nothings worked properly
with ada.integer_text_io; use ada.integer_text_io;
with Ada.Text_IO; use Ada.Text_IO;
procedure Spigot is
n : constant Integer := 1000;
length : constant Integer := 10*n/3+1;
x,q,nines,predigit :Integer :=0;
a: array (0..length) of Integer;
begin
nines:=0;
predigit:=0;
for j in 0..length loop
a(j):=2;
end loop;
for j in 1..n loop
q:=0;
for i in reverse 1..length loop
x:=10*a(i) + q*i;
a(i):= x mod (2*i-1);
q:= x/(2*i-1);
end loop;
a(1):= q mod 10;
q:=q/10;
if q = 9 then
nines:=nines+1;
elsif q = 10 then
put(predigit+1);
for k in 0..nines loop
put("0");
end loop;
predigit:=0;
nines:=0;
else
put(predigit);
predigit:=q;
if nines/=0 then
for k in 0..nines loop
put("9");
end loop;
nines:=0;
end if;
end if;
end loop;
put(predigit);
end Spigot;
so it should just be displayed at 0 3 1 4 1 5 9 2 6 5 3 5 8 9... but the output i get is 0 3 1 4 1 599 2 6 5 3 5 89... it should only be 1 digit at a time and also the outputted values for pi aren't completely correct
I don't know the algorithm well enough to talk about why the digits are off, but I did notice some issues:
Your array is defined with bounds 0 .. Length, which would give you 1 extra element
In your loop that does the calculation, you loop from 1..length, which is ok, but you don't adjust the variable i consistently. The array indices need to be one less than the i's used in the actual calculations (keep in mind they still have to be correctly in bounds of your array). For example
x:=10*a(i) + q*i;
needs to either be
x:=10*a(i-1) + q*i;
or
x:=10*a(i) + q*(i+1);
depending on what you decide your array bounds to be. This applies to multiple lines in your code. See this Stackoverflow thread
You assign A(1) when your array starts at 0
Your loops to print out "0" and "9" should be either 1..length or 0 .. length-1
When you print the digits using Integer_Text_IO.Put, you need to specify a width of 1 to get rid of the spaces
There might be more, that's all I saw.
I think you are translating this answer.
You need to be more careful of your indices and your loop ranges; for example, you’ve translated
for(int i = len; i > 0; --i) {
int x = 10 * A[i-1] + q*i;
A[i-1] = x % (2*i - 1);
q = x / (2*i - 1);
}
as
for i in reverse 1..length loop
x:=10*a(i) + q*i;
a(i):= x mod (2*i-1);
q:= x/(2*i-1);
end loop;
The loop ranges are the same. But in the seocnd line, the C code uses A[i-1], whereas yours uses a(i); similarly in the third line.
Later, for
for (int k = 0; k < nines; ++k) {
printf("%d", 0);
}
you have
for k in 0..nines loop
put("0");
end loop;
in which the C loop runs from 0 to nines - 1, but yours runs from 0 to nines. So you put out one more 0 than you should (and later on, likewise for 9s).
Also, you should use put (predigit, width=> 0).

Julia: Searching for a column in a sorted matrix

I have a matrix that is sorted like the one shown below
1 1 2 2 3
1 2 3 4 1
2 1 2 1 1
It's a bit hard for me to describe the ordering, but hopefully it's clear from the example. The rough idea is that we first sort on the first row, then the second, etc.
I would like to find a specific column in the matrix, and that column may or may not exist in it.
I tried the following code:
index = searchsortedfirst(1:total_cols, col, lt=(index,x) -> (matrix[: index] < x))
The above code works, but it is slow. I profiled the code, and it spends a lot of time in "_get_index". I then tried the following
#views index = searchsortedfirst(1:total_cols, col, lt=(index,x) -> (matrix[: index] < x))
As expected this helped a lot, likely due to the slices I'm taking. However, is there a better way to go about this? There still seems to be a lot of overhead, and I feel like there might be a cleaner way to write this, which would be easier to optimize.
However, I absolutely value speed over clarity.
Here is some code I wrote to compare binary vs. linear search.
using Profile
function test_search()
max_val = 20
rows = 4
matrix = rand(1:max_val, rows, 10^5)
matrix = Array{Int64,2}(sortslices(matrix, dims=2))
indices = #time #profile lin_search(matrix, rows, max_val, 10^3)
indices = #time #profile bin_search(matrix, rows, max_val, 10^3)
end
function bin_search(matrix, rows, max_val, repeats)
indices = zeros(repeats)
x = zeros(Int64, rows)
cols = size(matrix)[2]
for i = 1:repeats
x = rand(1:max_val, rows)
#inbounds #views index = searchsortedfirst(1:cols, x, lt=(index,x)->(matrix[:,index] < x))
indices[i] = index
end
return indices
end
function array_eq(matrix, index, y, rows)
for i=1:rows
#inbounds if view(matrix, i, index) != y[i]
return false
end
end
return true
end
function lin_search(matrix, rows, max_val, repeats)
indices = zeros(repeats)
x = zeros(Int64, rows)
cols = size(matrix)[2]
for i = 1:repeats
index = cols + 1
x = rand(1:max_val, rows)
for j=1:cols
if array_eq(matrix, j, x, rows)
index = j;
break
end
end
indices[i] = index
end
return indices
end
Profile.clear()
test_search()
Here is some sample output
0.041356 seconds (68.90 k allocations: 3.431 MiB)
0.070224 seconds (110.45 k allocations: 5.418 MiB)
After adding some more #inbounds, it looks like a linear search is faster than binary. Seems strange when there are 10^5 columns.
If speed is most important, why not simply use the fact that Julia allows you to write fast loops?
julia> function findcol(M, col)
#inbounds #views for c in axes(M, 2)
M[:,c] == col && return c
end
return nothing
end
findcol (generic function with 1 method)
julia> col = [2,3,2];
julia> M = [1 1 2 2 3;
1 2 3 4 1;
2 1 2 1 1];
julia> #btime findcol($M, $col)
32.854 ns (3 allocations: 144 bytes)
3
This should probably be fast enough and does not even take into account any ordering.
I discovered two issues, that when fixed result in both linear and binary searches being much faster. And the binary search becomes faster than linear.
First, there was some type instability. I changed on one of the lines to
matrix::Array{Int64,2} = Array{Int64,2}(sortslices(matrix, dims=2))
This resulted in an order of magnitude speedup. Also it turns out that using #views does not do anything in the following code
#inbounds #views index = searchsortedfirst(1:cols, x, lt=(index,x)->(matrix[:,index] < x))
I am new to Julia, but my hunch is that since matrix[:,index] is copied no matter what in the anonymous function. This would make sense, since it allows for closures.
If I write a separate non-anonymous function, then that copy goes away. Linear search didn't copy the slices, so this also really sped up the binary search.

how to change max recursion depth in Julia?

I was curious how quick and accurate, algorithm from Rosseta code ( https://rosettacode.org/wiki/Ackermann_function ) for (4,2) parameters, could be. But got StackOverflowError.
julia> using Memoize
#memoize ack3(m, n) =
m == 0 ? n + 1 :
n == 0 ? ack3(m-1, 1) :
ack3(m-1, ack3(m, n-1))
# WARNING! Next line has to calculate and print number with 19729 digits!
julia> ack3(4,2) # -> StackOverflowError
# has to be -> 2003529930406846464979072351560255750447825475569751419265016973710894059556311
# ...
# 4717124577965048175856395072895337539755822087777506072339445587895905719156733
EDIT:
Oscar Smith is right that trying ack3(4,2) is unrealistic. This is version translated from Rosseta's C++:
module Ackermann
function ackermann(m::UInt, n::UInt)
function ack(m::UInt, n::BigInt)
if m == 0
return n + 1
elseif m == 1
return n + 2
elseif m == 2
return 3 + 2 * n;
elseif m == 3
return 5 + 8 * (BigInt(2) ^ n - 1)
else
if n == 0
return ack(m - 1, BigInt(1))
else
return ack(m - 1, ack(m, n - 1))
end
end
end
return ack(m, BigInt(n))
end
end
julia> import Ackermann;Ackermann.ackermann(UInt(1),UInt(1));#time(a4_2 = Ackermann.ackermann(UInt(4),UInt(2)));t = "$a4_2"; println("len = $(length(t)) first_digits=$(t[1:20]) last digits=$(t[end-20:end])")
0.000041 seconds (57 allocations: 33.344 KiB)
len = 19729 first_digits=20035299304068464649 last digits=445587895905719156733
Julia itself does not have an internal limit to the stack size, but your operating system does. The exact limits here (and how to change them) will be system dependent. On my Mac (and I assume other POSIX-y systems), I can check and change the stack size of programs that get called by my shell with ulimit:
$ ulimit -s
8192
$ julia -q
julia> f(x) = x > 0 ? f(x-1) : 0 # a simpler recursive function
f (generic function with 1 method)
julia> f(523918)
0
julia> f(523919)
ERROR: StackOverflowError:
Stacktrace:
[1] f(::Int64) at ./REPL[1]:1 (repeats 80000 times)
$ ulimit -s 16384
$ julia -q
julia> f(x) = x > 0 ? f(x-1) : 0
f (generic function with 1 method)
julia> f(1048206)
0
julia> f(1048207)
ERROR: StackOverflowError:
Stacktrace:
[1] f(::Int64) at ./REPL[1]:1 (repeats 80000 times)
I believe the exact number of recursive calls that will fit on your stack will depend upon both your system and the complexity of the function itself (that is, how much each recursive call needs to store on the stack). This is the bare minimum. I have no idea how big you'd need to make the stack limit in order to compute that Ackermann function.
Note that I doubled the stack size and it more than doubled the number of recursive calls — this is because of a constant overhead:
julia> log2(523918)
18.998981503278365
julia> 2^19 - 523918
370
julia> log2(1048206)
19.99949084151746
julia> 2^20 - 1048206
370
Just fyi, even if you change the max recursion depth, you won't get the right answer as Julia uses 64 bit integers, so integer overflow with make stuff not work. To get the right answer, you will have to use big ints to have any hope. The next problem is that you probably don't want to memoize, as almost all of the computations are not repeated, and you will be computing the function more than 10^19729 different inputs, which you really do not want to store.

Determining the big Oh for (n-1)+(n-1)

I have been trying to get my head around this perticular complexity computation but everything i read about this type of complexity says to me that it is of type big O(2^n) but if i add a counter to the code and check how many times it iterates per given n it seems to follow the curve of 4^n instead. Maybe i just misunderstood as i placed an count++; inside the scope.
Is this not of type big O(2^n)?
public int test(int n)
{
if (n == 0)
return 0;
else
return test(n-1) + test(n-1);
}
I would appreciate any hints or explanation on this! I completely new to this complexity calculation and this one has thrown me off the track.
//Regards
int test(int n)
{
printf("%d\n", n);
if (n == 0) {
return 0;
}
else {
return test(n - 1) + test(n - 1);
}
}
With a printout at the top of the function, running test(8) and counting the number of times each n is printed yields this output, which clearly shows 2n growth.
$ ./test | sort | uniq -c
256 0
128 1
64 2
32 3
16 4
8 5
4 6
2 7
1 8
(uniq -c counts the number of times each line occurs. 0 is printed 256 times, 1 128 times, etc.)
Perhaps you mean you got a result of O(2n+1), rather than O(4n)? If you add up all of these numbers you'll get 511, which for n=8 is 2n+1-1.
If that's what you meant, then that's fine. O(2n+1) = O(2⋅2n) = O(2n)
First off: the 'else' statement is obsolete since the if already returns if it evaluates to true.
On topic: every iteration forks 2 different iterations, which fork 2 iterations themselves, etc. etc. As such, for n=1 the function is called 2 times, plus the originating call. For n=2 it is called 4+1 times, then 8+1, then 16+1 etc. The complexity is therefore clearly 2^n, since the constant is cancelled out by the exponential.
I suspect your counter wasn't properly reset between calls.
Let x(n) be a number of total calls of test.
x(0) = 1
x(n) = 2 * x(n - 1) = 2 * 2 * x(n-2) = 2 * 2 * ... * 2
There is total of n twos - hence 2^n calls.
The complexity T(n) of this function can be easily shown to equal c + 2*T(n-1). The recurrence given by
T(0) = 0
T(n) = c + 2*T(n-1)
Has as its solution c*(2^n - 1), or something like that. It's O(2^n).
Now, if you take the input size of your function to be m = lg n, as might be acceptable in this scenario (the number of bits to represent n, the true input size) then this is, in fact, an O(m^4) algorithm... since O(n^2) = O(m^4).

Divide by 10 using bit shifts?

Is it possible to divide an unsigned integer by 10 by using pure bit shifts, addition, subtraction and maybe multiply? Using a processor with very limited resources and slow divide.
Editor's note: this is not actually what compilers do, and gives the wrong answer for large positive integers ending with 9, starting with div10(1073741829) = 107374183 not 107374182. It is exact for smaller inputs, though, which may be sufficient for some uses.
Compilers (including MSVC) do use fixed-point multiplicative inverses for constant divisors, but they use a different magic constant and shift on the high-half result to get an exact result for all possible inputs, matching what the C abstract machine requires. See Granlund & Montgomery's paper on the algorithm.
See Why does GCC use multiplication by a strange number in implementing integer division? for examples of the actual x86 asm gcc, clang, MSVC, ICC, and other modern compilers make.
This is a fast approximation that's inexact for large inputs
It's even faster than the exact division via multiply + right-shift that compilers use.
You can use the high half of a multiply result for divisions by small integral constants. Assume a 32-bit machine (code can be adjusted accordingly):
int32_t div10(int32_t dividend)
{
int64_t invDivisor = 0x1999999A;
return (int32_t) ((invDivisor * dividend) >> 32);
}
What's going here is that we're multiplying by a close approximation of 1/10 * 2^32 and then removing the 2^32. This approach can be adapted to different divisors and different bit widths.
This works great for the ia32 architecture, since its IMUL instruction will put the 64-bit product into edx:eax, and the edx value will be the wanted value. Viz (assuming dividend is passed in eax and quotient returned in eax)
div10 proc
mov edx,1999999Ah ; load 1/10 * 2^32
imul eax ; edx:eax = dividend / 10 * 2 ^32
mov eax,edx ; eax = dividend / 10
ret
endp
Even on a machine with a slow multiply instruction, this will be faster than a software or even hardware divide.
Though the answers given so far match the actual question, they do not match the title. So here's a solution heavily inspired by Hacker's Delight that really uses only bit shifts.
unsigned divu10(unsigned n) {
unsigned q, r;
q = (n >> 1) + (n >> 2);
q = q + (q >> 4);
q = q + (q >> 8);
q = q + (q >> 16);
q = q >> 3;
r = n - (((q << 2) + q) << 1);
return q + (r > 9);
}
I think that this is the best solution for architectures that lack a multiply instruction.
Of course you can if you can live with some loss in precision. If you know the value range of your input values you can come up with a bit shift and a multiplication which is exact.
Some examples how you can divide by 10, 60, ... like it is described in this blog to format time the fastest way possible.
temp = (ms * 205) >> 11; // 205/2048 is nearly the same as /10
to expand Alois's answer a bit, we can expand the suggested y = (x * 205) >> 11 for a few more multiples/shifts:
y = (ms * 1) >> 3 // first error 8
y = (ms * 2) >> 4 // 8
y = (ms * 4) >> 5 // 8
y = (ms * 7) >> 6 // 19
y = (ms * 13) >> 7 // 69
y = (ms * 26) >> 8 // 69
y = (ms * 52) >> 9 // 69
y = (ms * 103) >> 10 // 179
y = (ms * 205) >> 11 // 1029
y = (ms * 410) >> 12 // 1029
y = (ms * 820) >> 13 // 1029
y = (ms * 1639) >> 14 // 2739
y = (ms * 3277) >> 15 // 16389
y = (ms * 6554) >> 16 // 16389
y = (ms * 13108) >> 17 // 16389
y = (ms * 26215) >> 18 // 43699
y = (ms * 52429) >> 19 // 262149
y = (ms * 104858) >> 20 // 262149
y = (ms * 209716) >> 21 // 262149
y = (ms * 419431) >> 22 // 699059
y = (ms * 838861) >> 23 // 4194309
y = (ms * 1677722) >> 24 // 4194309
y = (ms * 3355444) >> 25 // 4194309
y = (ms * 6710887) >> 26 // 11184819
y = (ms * 13421773) >> 27 // 67108869
each line is a single, independent, calculation, and you'll see your first "error"/incorrect result at the value shown in the comment. you're generally better off taking the smallest shift for a given error value as this will minimise the extra bits needed to store the intermediate value in the calculation, e.g. (x * 13) >> 7 is "better" than (x * 52) >> 9 as it needs two less bits of overhead, while both start to give wrong answers above 68.
if you want to calculate more of these, the following (Python) code can be used:
def mul_from_shift(shift):
mid = 2**shift + 5.
return int(round(mid / 10.))
and I did the obvious thing for calculating when this approximation starts to go wrong with:
def first_err(mul, shift):
i = 1
while True:
y = (i * mul) >> shift
if y != i // 10:
return i
i += 1
(note that // is used for "integer" division, i.e. it truncates/rounds towards zero)
the reason for the "3/1" pattern in errors (i.e. 8 repeats 3 times followed by 9) seems to be due to the change in bases, i.e. log2(10) is ~3.32. if we plot the errors we get the following:
where the relative error is given by: mul_from_shift(shift) / (1<<shift) - 0.1
Considering Kuba Ober’s response, there is another one in the same vein.
It uses iterative approximation of the result, but I wouldn’t expect any surprising performances.
Let say we have to find x where x = v / 10.
We’ll use the inverse operation v = x * 10 because it has the nice property that when x = a + b, then x * 10 = a * 10 + b * 10.
Let use x as variable holding the best approximation of result so far. When the search ends, x Will hold the result. We’ll set each bit b of x from the most significant to the less significant, one by one, end compare (x + b) * 10 with v. If its smaller or equal to v, then the bit b is set in x. To test the next bit, we simply shift b one position to the right (divide by two).
We can avoid the multiplication by 10 by holding x * 10 and b * 10 in other variables.
This yields the following algorithm to divide v by 10.
uin16_t x = 0, x10 = 0, b = 0x1000, b10 = 0xA000;
while (b != 0) {
uint16_t t = x10 + b10;
if (t <= v) {
x10 = t;
x |= b;
}
b10 >>= 1;
b >>= 1;
}
// x = v / 10
Edit: to get the algorithm of Kuba Ober which avoids the need of variable x10 , we can subtract b10 from v and v10 instead. In this case x10 isn’t needed anymore. The algorithm becomes
uin16_t x = 0, b = 0x1000, b10 = 0xA000;
while (b != 0) {
if (b10 <= v) {
v -= b10;
x |= b;
}
b10 >>= 1;
b >>= 1;
}
// x = v / 10
The loop may be unwinded and the different values of b and b10 may be precomputed as constants.
On architectures that can only shift one place at a time, a series of explicit comparisons against decreasing powers of two multiplied by 10 might work better than the solution form hacker's delight. Assuming a 16 bit dividend:
uint16_t div10(uint16_t dividend) {
uint16_t quotient = 0;
#define div10_step(n) \
do { if (dividend >= (n*10)) { quotient += n; dividend -= n*10; } } while (0)
div10_step(0x1000);
div10_step(0x0800);
div10_step(0x0400);
div10_step(0x0200);
div10_step(0x0100);
div10_step(0x0080);
div10_step(0x0040);
div10_step(0x0020);
div10_step(0x0010);
div10_step(0x0008);
div10_step(0x0004);
div10_step(0x0002);
div10_step(0x0001);
#undef div10_step
if (dividend >= 5) ++quotient; // round the result (optional)
return quotient;
}
Well division is subtraction, so yes. Shift right by 1 (divide by 2). Now subtract 5 from the result, counting the number of times you do the subtraction until the value is less than 5. The result is number of subtractions you did. Oh, and dividing is probably going to be faster.
A hybrid strategy of shift right then divide by 5 using the normal division might get you a performance improvement if the logic in the divider doesn't already do this for you.
I've designed a new method in AVR assembly, with lsr/ror and sub/sbc only. It divides by 8, then sutracts the number divided by 64 and 128, then subtracts the 1,024th and the 2,048th, and so on and so on. Works very reliable (includes exact rounding) and quick (370 microseconds at 1 MHz).
The source code is here for 16-bit-numbers:
http://www.avr-asm-tutorial.net/avr_en/beginner/DIV10/div10_16rd.asm
The page that comments this source code is here:
http://www.avr-asm-tutorial.net/avr_en/beginner/DIV10/DIV10.html
I hope that it helps, even though the question is ten years old.
brgs, gsc
elemakil's comments' code can be found here: https://doc.lagout.org/security/Hackers%20Delight.pdf
page 233. "Unsigned divide by 10 [and 11.]"

Resources