This question already has answers here:
What's the purpose of the LEA instruction?
(17 answers)
Closed 7 years ago.
Qt shows me assembler (x86) command this way:lea (%edx,%eax,1),%ecx
What does it means? I can't find description of this instruction that fits.
UPD
Thanks. So, am i correct?
0) movl $0x0,-0x14(%ebp)
1) mov -0x14(%ebp),%eax
2) inc %eax
3) mov %eax,-0x14(%ebp)
4) mov -0x14(%ebp),%eax
5) inc %eax
6) mov %eax,-0x14(%ebp)
7) mov -0x14(%ebp),%edx
8) mov -0x14(%ebp),%eax
9) lea (%edx,%eax,1),%ecx
10) mov %ecx,-0x14(%ebp)
11) inc %edx
12) mov %edx,-0x14(%ebp)
13) inc %eax
14) mov %eax,-0x14(%ebp)
Makes this("i" is "-0x14(%ebp)"):
0) i=0
1) eax=0
2) eax=1
3) i=1
4) eax=1
5) eax=2
6) i=2
7) edx=2
8) eax=2
9) ecx=2; ecx=4
10) i=4
11) edx=3
12) i=3
13) eax=3
14) i=3
LEA stands for Load Effective Address. It's used to perform an address computation, but then rather than accessing the value at that address it stores the computed address in the destination register.
In this case, it's storing the value of EDX + (EAX * 1) into ECX. This is an alternative to the two-instruction sequence
movl %edx, %ecx
addl %eax, %ecx
Related
This question already has answers here:
What registers are preserved through a linux x86-64 function call
(3 answers)
Optimizing a loop that prints a counter
(2 answers)
Why is imul used for multiplying unsigned numbers?
(2 answers)
Why is %eax zeroed before a call to printf?
(3 answers)
Closed 6 months ago.
I am trying to code a program which prints the squares between 1 and 5 in assembly. But the output is not as expected. The function calling conventions used are those of UNIX.
My code:
.intel_syntax noprefix
.data
msg: .asciz "%d: %d\n"
.text
.global main
.type main, #function
main: PUSH RBP
MOV RBP, RSP
MOV RSI, 1 # i = 0
square: MOV RAX, RSI
MUL RSI
MOV RDI, offset flat:msg
MOV RDX, RAX
CALL printf
INC RSI
CMP RSI, 5
JB square
MOV EAX, 0
POP RBP
RET
Output: 1: 1
I am trying to code a recursive Fibonacci program in x86 Assembly (Intel AT&T syntax). I get a StackOverflow error in the form of a segmentation fault. Below is my code:
# Function signature:
# int factorial(int n)
.text
.equ n, 8
.equ fibMinus1, -4
.global fibonacci
fibonacci:
# Prologue (prepare the stack frame)
push %ebp
mov %esp, %ebp
# ECX is the non-volatile register which stores n
movl n(%ebp), %ecx
# Make space on the stack frame to store fib(n - 1)
subl $4, %esp
# If n == 0:
cmpl $0, %ecx
# return 0
je retZero
# If n == 1:
cmpl $1, %ecx
# return 1
je retOne
# EDX = fibonacci(n - 1)
decl %ecx
push %ecx
call fibonacci
movl %eax, fibMinus1(%esp)
# EAX = fibonacci(n - 2)
movl n(%ebp), %ecx
subl $2, %ecx
push %ecx
call fibonacci
# fibonacci(n - 1) + fibonacci(n - 2)
addl fibMinus1(%esp), %eax
# Epilogue (cleanup the stack frame)
mov %ebp, %esp
pop %ebp
# Return fibonacci(n - 1) + fibonacci(n - 2)
ret
retZero:
movl $0, %eax
# Epilogue (cleanup the stack frame)
mov %ebp, %esp
pop %ebp
# Return the maximum possible weight of the bags :D
ret
retOne:
movl $1, %eax
# Epilogue (cleanup the stack frame)
mov %ebp, %esp
pop %ebp
# Return the maximum possible weight of the bags :D
ret
This seems strange, considering that I modify the necessary parameters when pushing them onto the stack before the function call, and after each function call, I execute the epilogue sequence to reframe the stack to its proper position such that the proper parameters are retrieved during the next call.
My code models the following Python recursive code:
I am trying to implement Knapsack's algorithm through recursion in x86 Assembly, modeling the following Java code. However, when I run my program, it seems as if the parameter taking in the capacity of the bag (the 4th parameter) is changed following a recursive call (specifically following the prologue).
The initial call is:
knapsack(weights, values, 2, 2, 0)
When I look in the debugger, initially all the parameters are taken in as correctly:
weights is the pointer to the correct array (0x5655b2f0) ($ebp + 8)
values is the pointer to the correct array (0x5655b300) ($ebp + 12)
num_items = 2 ($ebp + 16)
capacity = 2 ($ebp + 20)
cur_value = 0 ($ebp + 24)
However, in the code executed at maximizeItemUsage, I execute the following recursive call:
knapsack(weights, values, 2, 1, 0)
However, when I look at my debugger, after the following line of code in the prologue (in knapsack):
mov %esp, %ebp
I get the following data in the parameters:
0 ($ebp + 8)
2 ($ebp + 12)
1 ($ebp + 16)
0x5655b300 ($ebp + 20)
0x5655b2f0 ($ebp + 24)
This seems quite strange, considering that the prologue is to supposed to align the stack properly. Below is my code:
# Function signature:
# int knapsack(int* w, int* v, int num_items,
# int capacity, int current);
.text
# Define our macros, which store the location of the parameters and values
# relative to the stack's base pointer (EBP)
.equ weights, 8
.equ values, 12
.equ num_items, 16
.equ capacity, 20
.equ cur_value, 24
.equ do_not_use, -4
.equ use, -8
.global knapsack
knapsack:
# solves the knapsack problem
# #weights: an array containing how much each item weighs
# #values: an array containing the value of each item
# #num_items: how many items that we have
# #capacity: the maximum amount of weight that can be carried
# #cur_weight: the current weight
# #cur_value: the current value of the items in the pack
# Prologue (prepare the stack frame)
push %ebp
mov %esp, %ebp
# Make space for local variables on the stack
# 3 variables (4 bytes each)
# Default values are 0
#
# 1. the value of the bag if we DO NOT use the current item
# 2. the value of the bag if we USE the current item
# 3. the maximum value of the bag currently
sub $8, %esp
movl $0, do_not_use(%ebp)
movl $0, use(%ebp)
# Base Case: we have utilized all the items or there is no more space left
# in the bag (num_items = 0 or capacity = 0)
cmpl $0, num_items(%ebp)
jle emptyBag
cmpl $0, capacity(%ebp)
jle emptyBag
# Case 1: We do not use the current element because adding it wil surpass capacity
# weights[n - 1] > capacity
# Push the new parameters in the stack (everything stays the same, except that
# n -> n - 1 since we are no longer using the current element)
# Compute weights[n - 1] (stored in ECX register)
# Get the memory address of the values array
movl weights(%ebp), %ecx
# Move to the memory address of weights[n - 1]
push %edx
movl num_items(%ebp), %edx # num_items
decl %edx # num_items - 1
imul $4, %edx # 4(num_items - 1)
addl %edx, %ecx # m_v + 4(num_items - 1)
# Get the actual value of weights[n - 1]
movl (%ecx), %ecx # Get value at address m_w + 4(num_items - 1)
pop %edx
# If weights[n - 1] > capacity
cmpl %ecx, capacity(%ebp)
# Shift to analyzing the previous items
jl analyzePreviousItems
# Case 2: We can use the current element (in this case, find the maximum of the values
# we use the element or if we do not use the element
jmp maximizeItemUsage
# Solidify the EAX register data
movl %eax, %eax
# Epilogue (cleanup the stack frame)
mov %ebp, %esp
pop %ebp
# Return the maximum possible weight of the bags :D
ret
maximizeItemUsage:
# knapsack(weights, values, num_items - 1, capacity, current_value)
# All parameters remain the same (except that n > n - 1)
push weights(%ebp)
push values(%ebp)
# num_items - 1
movl num_items(%ebp), %edx
decl %edx
push %edx
push capacity(%ebp)
push cur_value(%ebp)
# Call knapsack(weights, values, num_items - 1, capacity, cur_value)
call knapsack
# The value if we DO NOT use the current item
movl %eax, do_not_use(%ebp)
# knapsack(weights, values, n - 1, c - weights[n]) + values[n]
# All parameters remain the same (except that n > n - 1 and c > c - weights[n])
push weights(%ebp)
push values(%ebp)
# num_items - 1
movl num_items(%ebp), %edx
decl %edx
# capacity - weights[n] (stored in the ECX register)
# Get the memory address of the values array
movl weights(%ebp), %ecx
# Move to the memory address of weights[n]
push %edx
movl num_items(%ebp), %edx # num_items
imul $4, %edx # 4(num_items)
addl %edx, %ecx # m_w + 4(num_items)
# Restore the value of the EDX register
pop %edx
# Get the actual value of weights[n]
movl (%ecx), %ecx # Get value at address m_w + 4(num_items)
# -weights[n]
neg %ecx
# capacity - weights[n]
addl capacity(%ebp), %ecx
push %ecx
push cur_value(%ebp)
# Call knapsack(weights, values, num_items - 1, capacity - weights[n], cur_value)
call knapsack
# The value if we USE the current item
movl %eax, use(%ebp)
# Compute the maximum value of knapsack(weights, values, num_items - 1, capacity, cur_value)
# and knapsack(weights, values, n - 1, c - weights[n]) + values[n]
movl use(%ebp), %ecx
cmpl %ecx, do_not_use(%ebp)
jl setUseAsMax
movl do_not_use(%ebp), %eax
# Epilogue (cleanup the stack frame)
mov %ebp, %esp
pop %ebp
ret
setUseAsMax:
movl use(%ebp), %eax
# Epilogue (cleanup the stack frame)
mov %ebp, %esp
pop %ebp
ret
analyzePreviousItems:
# Recursive Call 1: knapsack(weights, values, n - 1, c)
# All parameters remain the same (except that n -> n - 1)
push weights(%ebp)
push values(%ebp)
# We use the EDX to contain the changed parameters since it is a
# non-volatile register
movl num_items(%ebp), %edx
decl %edx
push %edx
push capacity(%ebp)
push cur_value(%ebp)
# Call knapsack(weights, values, num_items - 1, capacity, cur_value)
call knapsack
# The value if we DO NOT use the current item
movl %eax, do_not_use(%ebp)
# Epilogue (cleanup the stack frame)
mov %ebp, %esp
pop %ebp
# Return knapsack(weights, values, n - 1, c)
ret
emptyBag:
movl 0, %eax
# Epilogue (cleanup/realign the stack frame)
mov %ebp, %esp
pop %ebp
# Return 0
ret
# 3 variables (4 bytes each)
# Default values are 0
#
# 1. the value of the bag if we DO NOT use the current item
# 2. the value of the bag if we USE the current item
# 3. the maximum value of the bag currently
sub $8, %esp
movl $0, do_not_use(%ebp)
movl $0, use(%ebp)
The comments claim to reserve space for 3 variables (12 bytes) but the code only has sub $8, %esp.
push weights(%ebp)
push values(%ebp)
# num_items - 1
movl num_items(%ebp), %edx
decl %edx
push %edx
push capacity(%ebp)
push cur_value(%ebp)
# Call knapsack(weights, values, num_items - 1, capacity, cur_value)
call knapsack
In all 3 recursive calls you have pushed the arguments in the wrong order!
This is the correct way:
push cur_value(%ebp)
push capacity(%ebp)
movl num_items(%ebp), %edx
decl %edx
push %edx
push values(%ebp)
push weights(%ebp)
call knapsack
push %edx
movl num_items(%ebp), %edx # num_items
imul $4, %edx # 4(num_items)
addl %edx, %ecx # m_w + 4(num_items)
# Restore the value of the EDX register
pop %edx
In the 2nd recursive call, you have 1 argument too few! Why is there a pop %edx ?
# Case 2: We can use the current element (in this case, find the maximum of the values
# we use the element or if we do not use the element
jmp maximizeItemUsage
# Solidify the EAX register data
movl %eax, %eax
# Epilogue (cleanup the stack frame)
mov %ebp, %esp
pop %ebp
# Return the maximum possible weight of the bags :D
ret
Please note that the code below the jmp maximizeItemUsage is unreachable. It will never run.
call knapsack # -> %EAX
# The value if we DO NOT use the current item
movl %eax, do_not_use(%ebp)
# Epilogue (cleanup the stack frame)
mov %ebp, %esp
pop %ebp
# Return knapsack(weights, values, n - 1, c)
ret
In analyzePreviousItems that movl %eax, do_not_use(%ebp) instruction is silly because the concerned local variable is just about to stop existing.
And an optimization for free
call knapsack
# The value if we USE the current item
movl %eax, use(%ebp)
# Compute the maximum value of knapsack(weights, values, num_items - 1, capacity, cur_value)
# and knapsack(weights, values, n - 1, c - weights[n]) + values[n]
movl use(%ebp), %ecx
cmpl %ecx, do_not_use(%ebp)
jl setUseAsMax
movl do_not_use(%ebp), %eax
# Epilogue (cleanup the stack frame)
mov %ebp, %esp
pop %ebp
ret
setUseAsMax:
movl use(%ebp), %eax
# Epilogue (cleanup the stack frame)
mov %ebp, %esp
pop %ebp
ret
This part of your code gets much simpler if you don't reload the use variable to a different register when it is already present in %EAX.
call knapsack
movl %eax, use(%ebp) (*)
cmpl %eax, do_not_use(%ebp)
jl setUseAsMax
movl do_not_use(%ebp), %eax
setUseAsMax:
mov %ebp, %esp
pop %ebp
ret
(*) Here also you don't need to store movl %eax, use(%ebp) because the use variable is about to get terminated.
I'm trying to design an algorithm in NASM which is supposed to compute the nth Fibonacci number. I wrote some code, but, when run, it outputs only 1 and I cannot understand why. My idea was to call Fibonacci for n-1 and for n-2, where the parameter n is stored within AX register, and the result is stored at BX.
%include "io.inc"
section .text
global CMAIN
CMAIN:
mov ebp, esp; for correct debugging
xor eax, eax
xor ebx, ebx
push word 12
call fibonacci
pop word [trash]
PRINT_UDEC 1, ax
ret
fibonacci: push ebp
mov ebp, esp
mov ax, [ebp + 8];se incarca ax cu parametrul primit prin stiva
cmp ax, 0
jne l1
mov bx, 0
jmp l3
l1: cmp ax, 1
jne l2
mov bx, 1
jmp l3
l2: sub ax, 2;n-2
push ax
call fibonacci
pop dx;resotre the stack
mov [ebp - 2], bx
mov ax, [ebp + 8];se incarca bx cu parametrul prinit prin stiva
sub ax, 1;n-1
push ax
call fibonacci
pop dx;restore the stack
add bx, [ebp - 2]
l3: mov esp, ebp
pop ebp
ret
section .data
trash dw 1
In my 80x86 assembly program, I am trying to calculate the equation of
(((((2^0 + 2^1) * 2^2) + 2^3) * 2^4) + 2^5)...(2^n), where each even exponent is preceded by a multiplication and each odd exponent is preceded by a plus. I have code, but my result is continuously off from the desired result. When 5 is put in for n, the result should be 354, however I get 330.
Any and all advice will be appreciated.
.586
.model flat
include io.h
.stack 4096
.data
number dword ?
prompt byte "enter the power", 0
string byte 40 dup (?), 0
result byte 11 dup (?), 0
lbl_msg byte "answer", 0
bool dword ?
runtot dword ?
.code
_MainProc proc
input prompt, string, 40
atod string
push eax
call power
add esp, 4
dtoa result, eax
output lbl_msg, result
mov eax, 0
ret
_MainProc endp
power proc
push ebp
mov ebp, esp
push ecx
mov bool, 1 ;initial boolean value
mov eax, 1
mov runtot, 2 ;to keep a running total
mov ecx, [ebp + 8]
jecxz done
loop1:
add eax, eax ;power of 2
test bool, ecx ;test case for whether exp is odd/even
jnz oddexp ;if boolean is 1
add runtot, eax ;if boolean is 0
loop loop1
oddexp:
mov ebx, eax ;move eax to seperate register for multiplication
mov eax, runtot ;move existing total for multiplication
mul ebx ;multiplication of old eax to new eax/running total
loop loop1
done:
mov eax, runtot ;move final runtotal for print
pop ecx
pop ebp
ret
power endp
end
You're overcomplicating your code with static variables and branching.
These are powers of 2, you can (and should) just left-shift by n instead of actually constructing 2^n and using a mul instruction.
add eax,eax is the best way to multiply by 2 (aka left shift by 1), but it's not clear why you're doing that to the value in EAX at that point. It's either the multiply result (which you probably should have stored back into runtot after mul), or it's that left-shifted by 1 after an even iteration.
If you were trying to make a 2^i variable (with a strength reduction optimization to shift by 1 every iteration instead of shifting by i), then your bug is that you clobber EAX with mul, and its setup, in the oddexp block.
As Jester points out, if the first loop loop1 falls through, it will fall through into oddexp:. When you're doing loop tail duplication, make sure you consider where fall-through will go from each tail if the loop does end there.
There's also no point in having a static variable called bool which holds a 1, which you only use as an operand for test. That implies to human readers that the mask sometimes needs to change; test ecx,1 is a lot clearer as a way to check the low bit for zero / non-zero.
You also don't need static storage for runtot, just use a register (like EAX where you want the result eventually anyway). 32-bit x86 has 7 registers (not including the stack pointer).
This is how I'd do it. Untested, but I simplified a lot by unrolling by 2. Then the test for odd/even goes away because that alternating pattern is hard-coded into the loop structure.
We increment and compare/branch twice in the loop, so unrolling didn't get rid of the loop overhead, just changed one of the loop branches into an an if() break that can leave the loop from the middle.
This is not the most efficient way to write this; the increment and early-exit check in the middle of the loop could be optimized away by counting another counter down from n, and leaving the loop if there are less than 2 steps left. (Then sort it out in the epilogue)
;; UNTESTED
power proc ; fastcall calling convention: arg: ECX = unsigned int n
; clobbers: ECX, EDX
; returns: EAX
push ebx ; save a call-preserved register for scratch space
mov eax, 1 ; EAX = 2^0 running total / return value
test ecx,ecx
jz done
mov edx, ecx ; EDX = n
mov ecx, 1 ; ECX = i=1..n loop counter and shift count
loop1: ; do{ // unrolled by 2
; add 2^odd power
mov ebx, 1
shl ebx, cl ; 2^i ; xor ebx, ebx; bts ebx, ecx
add eax, ebx ; total += 2^i
inc ecx
cmp ecx, edx
jae done ; if (++i >= n) break;
; multiply by 2^even power
shl eax, cl ; total <<= i; // same as total *= (1<<i)
inc ecx ; ++i
cmp ecx, edx
jb loop1 ; }while(i<n);
done:
pop ebx
ret
I didn't check if the adding-odd-power step ever produces a carry into another bit. I think it doesn't, so it could be safe to implement it as bts eax, ecx (setting bit i). Effectively an OR instead of an ADD, but those are equivalent as long as the bit was previously cleared.
To make the asm look more like the source and avoid obscure instructions, I implemented 1<<i with shl to generate 2^i for total += 2^i, instead of a more-efficient-on-Intel xor ebx,ebx / bts ebx, ecx. (Variable-count shifts are 3 uops on Intel Sandybridge-family because of x86 flag-handling legacy baggage: flags have to be untouched if count=0). But that's worse on AMD Ryzen, where bts reg,reg is 2 uops but shl reg,cl is 1.
Update: i=3 does produce a carry when adding, so we can't OR or BTS the bit for that case. But optimizations are possible with more branching.
Using calc:
; define shiftadd_power(n) { local res=1; local i; for(i=1;i<=n;i++){ res+=1<<i; i++; if(i>n)break; res<<=i;} return res;}
shiftadd_power(n) defined
; base2(2)
; shiftadd_power(0)
1 /* 1 */
...
The first few outputs are:
n shiftadd(n) (base2)
0 1
1 11
2 1100
3 10100 ; 1100 + 1000 carries
4 101000000
5 101100000 ; 101000000 + 100000 set a bit that was previously 0
6 101100000000000
7 101100010000000 ; increasing amounts of trailing zero around the bit being flipped by ADD
Peeling the first 3 iterations would enable the BTS optimization, where you just set the bit instead of actually creating 2^n and adding.
Instead of just peeling them, we can just hard-code the starting point for i=3 for larger n, and optimize the code that figures out a return value for the n<3 case. I came up with a branchless formula for that based on right-shifting the 0b1100 bit-pattern by 3, 2, or 0.
Also note that for n>=18, the last shift count is strictly greater than half the width of the register, and the 2^i from odd i has no low bits. So only the last 1 or 2 iterations can affect the result. It boils down to either 1<<n for odd n, or 0 for even n. This simplifies to (n&1) << n.
For n=14..17, there are at most 2 bits set. Starting with result=0 and doing the last 3 or 4 iterations should be enough to get the correct total. In fact, for any n, we only need to do the last k iterations, where k is enough that the total shift count from even i is >= 32. Any bits set by earlier iterations are shifted out. (I didn't add a branch for this special case.)
;; UNTESTED
;; special cases for n<3, and for n>=18
;; enabling an optimization in the main loop (BTS instead of add)
;; funky overflow behaviour for n>31: large odd n gives 1<<(n%32) instead of 0
power_optimized proc
; fastcall calling convention: arg: ECX = unsigned int n <= 31
; clobbers: ECX, EDX
; returns: EAX
mov eax, 14h ; 0b10100 = power(3)
cmp ecx, 3
ja n_gt_3 ; goto main loop or fall through to hard-coded low n
je early_ret
;; n=0, 1, or 2 => 1, 3, 12 (0b1, 0b11, 0b1100)
mov eax, 0ch ; 0b1100 to be right-shifted by 3, 2, or 0
cmp ecx, 1 ; count=0,1,2 => CF,ZF,neither flag set
setbe cl ; count=0,1,2 => cl=1,1,0
adc cl, cl ; 3,2,0 (cl = cl+cl + (count<1) )
shr eax, cl
early_ret:
ret
large_n: ; odd n: result = 1<<n. even n: result = 0
mov eax, ecx
and eax, 1 ; n&1
shl eax, cl ; n>31 will wrap the shift count so this "fails"
ret ; if you need to return 0 for all n>31, add another check
n_gt_3:
;; eax = running total for i=3 already
cmp ecx, 18
jae large_n
mov edx, ecx ; EDX = n
mov ecx, 4 ; ECX = i=4..n loop counter and shift count
loop1: ; do{ // unrolled by 2
; multiply by 2^even power
shl eax, cl ; total <<= i; // same as total *= (1<<i)
inc edx
cmp ecx, edx
jae done ; if (++i >= n) break;
; add 2^odd power. i>3 so it won't already be set (thus no carry)
bts eax, edx ; total |= 1<<i;
inc ecx ; ++i
cmp ecx, edx
jb loop1 ; }while(i<n);
done:
ret
By using BTS to set a bit in EAX avoids needing an extra scratch register to construct 1<<i in, so we don't have to save/restore EBX. So that's a minor bonus saving.
Notice that this time the main loop is entered with i=4, which is even, instead of i=1. So I swapped the add vs. shift.
I still didn't get around to pulling the cmp/jae out of the middle of the loop. Something like lea edx, [ecx-2] instead of mov would set the loop-exit condition, but would require a check to not run the loop at all for i=4 or 5. For large-count throughput, many CPUs can sustain 1 taken + 1 not-taken branch every 2 clocks, not creating a worse bottleneck than the loop-carried dep chains (through eax and ecx). But branch-prediction will be different, and it uses more branch-order-buffer entries to record more possible roll-back / fast-recovery points.