Recursive procedure gives segfault

Recursive procedure gives segfault - recursion

I'm trying to make the Factorial procedure but it's not working for some reason.
I'm sorry, but I have literally no idea what the problem is. I tried debugging with gdb but I couldn't figure it out.
push 4
call Factorial
exit
Factorial:
cmp byte [rsp + 8], 0
jz end
mov eax, [rsp + 8]
mov ebx, eax
dec ebx
imul eax, ebx
add rsp, 8
push rbx
call Factorial
end:
ret

The problem is in the calling convention of procedure Factorial.
Recursive procedures are better called with input/output in registers instead of pushed on stack.
I used your algorithm for 32bit Windows and debugged in OllyDbg, but it shouldn't be difficult to modify it for 64bit Linux.
user259137 PROGRAM Format=PE,Width=32,Entry=Start
IMPORT ExitProcess,Lib=kernel32.dll
Start:
MOV EAX,4
CALL Factorial ; Returns factorial of EAX in EAX.
PUSH EAX ; Errorlevel to exit with.
JMP ExitProcess
Factorial: ; Returns factorial of EAX in EAX.
CMP EAX,1
JNA end
PUSH EBX ; Do not clobber any reg but EAX.
MOV EBX,EAX
DEC EAX
CALL Factorial
IMUL EAX,EBX
POP EBX
end:RET
ENDPROGRAM
Created with euroasm.exe user259137.asm.
Factorial can also be computed recursively at assembly time at macro-level, see this example.

Related

Understanding pointers in assembly language

Are we enclosing the variable or register in brackets to specify a pointer in assembly?
Example1;
MOV eax, array+4
LEA eax, [array+4]
Example2;
section .data
array DB 116,97
section .bss
variable RESB 0
section .text
global _start:
_start:
mov eax,[array]
;exit
mov eax,1
int 0x80
I am not getting any errors while compiling or running the above code. Is the address of the zero index of the array placed in the EAX register?
Example3;
INC [variable]
When compiling the above code, I am getting the "operation size not specified" error. And why can't the command be used as INC variable?
Example4;
section .data
array DB 116,97
section .bss
variable RESB 97
section .text
global _start:
_start:
mov eax,4
mov ebx,1
mov ecx,variable
mov edx,1
int 0x80
;exit
mov eax,1
int 0x80
And this code is not working.

Are we enclosing the variable or registrar in brackets to specify a
pointer in assembly?
Example1;
MOV eax, array+4
LEA eax, [array+4]
The brackets are like the dereference operator in C (*ptr). They get the value at the resulting address inside the square brackets. As for the example, both of these essentially do the same thing. The first moves the address of the array label + 4 into eax. The second uses lea, which loads the effective address of its source operand. So you get array + 4, dereference it, and get the address again with lea and load it into eax.
Example2;
section .data
array DB 116,97
section .bss
variable RESB 0
section .text
global _start:
_start:
mov eax,[array]
;exit
mov eax,1
int 0x80
I am not getting any errors while compiling or running the above code.
Is the address of the zero index of the array placed in the eax
register?
Kind of. Since you're moving it into eax, a 32-bit register, it is assumed that you want to move the first 4 bytes at the address array into eax. But there are only 2 bytes at array: 116 and 97. So this is probably not what you intended. To load the first byte at array into eax, do movzx eax, BYTE [array], which will move array[0] into the LSByte of eax and zero out the higher bytes. mov al, [array] will also work, though it won't zero out the upper bytes.
Example3;
INC [variable]
When compiling the above code, I am getting the "operation size not
specified" error. And why can't the command be used as INC variable.
The error says it all. variable is just an address. When you use [], how many bytes should it take? You need to specify a size. For example to get the first byte, you would do inc BYTE [variable]. However, from the previous example, it seems like you've reserved nothing at variable, so trying to access any bytes at it may cause some issue. As for "And why can't the command be used as INC variable", as I just said, variable is just a label which translates to some address. You can't change the address which variable translates to.
Example4;
section .data
array DB 116,97
section .bss
variable RESB 97
section .text
global _start:
_start:
mov eax,4
mov ebx,1
mov ecx,variable
mov edx,1
int 0x80
;exit
mov eax,1
int 0x80
And this code is not working.
It may seem to not be printing anything, but it actually is. .bss zero-initializes any memory that you reserve. That means when you print the first byte at variable, it just prints the NUL character. However, this doesn't seem to be visible for you when you print it, so it seems like nothing has been printed.
(By the way, are you certain that you know what resb does? In one example, you reserve 0 bytes, and in another, you reserve 97 bytes for no apparent reason. You might want to take another look at what resb actually does.)

array ; variable address
byte[array] ; value of first byte of array
word[array] ; value of first word of array
byte[array + 1] ; value of second byte of array
Think of the variable names as pointers, and using size[name] gets the value being pointed (similar to *name in C where name is a pointer)

8086 assembly program for nCr

I am currently studying microprocessors specifically the 8086 and was just recently introduced to assembly level language. I am currently stuck at the assembly level program for nCr(combination) through the recursive procedure for which the assembly level code is given below, can somebody please explain what's the logic used here, in calculating nCr through recursion.
I don't want the explanation of the whole program but I just want a basic idea about the working of recursion for nCr which I am not able to understand.
I have solved this problem of nCr through recursion(not in assembly language as I have started learning assembly recently) but with a different type of logic, that is by implementing the factorial function recursively and then calculating the nCr by using the formula of nCr, but some other logic is used here, which I am not able to understand.
This code was posted as an example on 8086{4}beginner.com, with no comments.
DATA SEGMENT
n db 10
r db 9
ncr db 0
DATA ENDS
code segment
assume cs:code,ds:data
start: mov ax,data
mov ds,ax
mov ncr,0
mov al,n
mov bl,r
call encr
call display
mov ah,4ch
int 21h
encr proc
cmp al,bl
je ncr1
cmp bl,0
je ncr1
cmp bl,1
je ncrn
dec al
cmp bl,al
je ncrn1
push ax
push bx
call encr
pop bx
pop ax
dec bl
push ax
push bx
call encr
pop bx
pop ax
ret
ncr1: inc ncr
ret
ncrn1: inc al
ncrn: add ncr,al
ret
encr endp
display proc
push cx
mov al,ncr
mov ch,al
and al,0f0h
mov cl,04
shr al,cl
cmp al,09h
jbe next
add al,07h
next:add al,30h
mov dl,al
mov ah,02h
int 21h
mov al,ch
and al,0fh
cmp al,09h
jbe next2
add al,07h
next2:add al,30h
mov dl,al
mov ah,02h
int 21h
pop cx
ret
display endp
code ends
end start
I don't want explanation for whole of the assembly code, I am just stuck at the logic behind recursion that is, I am not able to understand what logic is used for calculation of nCr. For example I know that if base cases don't match value of al is decreased until it matches one of the base cases (value of bl is also decreased by one as the program proceeds), and the calculation of nCr is done by either incrementing it by one or by adding al to it as per the base cases.
I am not able to understand how just some addition work can calculate nCr that is I am not able to understand logic behind recursion.

x86 Assembly pointers

I am trying to wrap my mind around pointers in Assembly.
What exactly is the difference between:
mov eax, ebx
and
mov [eax], ebx
and when should dword ptr [eax] should be used?
Also when I try to do mov eax, [ebx] I get a compile error, why is this?

As has already been stated, wrapping brackets around an operand means that that operand is to be dereferenced, as if it were a pointer in C. In other words, the brackets mean that you are reading a value from (or storing a value into) that memory location, rather than reading that value directly.
So, this:
mov eax, ebx
simply copies the value in ebx into eax. In a pseudo-C notation, this would be: eax = ebx.
Whereas this:
mov eax, [ebx]
dereferences the contents of ebx and stores the pointed-to value in eax. In a pseudo-C notation, this would be: eax = *ebx.
Finally, this:
mov [eax], ebx
stores the value in ebx into the memory location pointed to by eax. Again, in pseudo-C notation: *eax = ebx.
The registers here could also be replaced with memory operands, such as symbolic variable names. So this:
mov eax, [myVar]
dereferences the address of the variable myVar and stores the contents of that variable in eax, like eax = myVar.
By contrast, this:
mov eax, myVar
stores the address of the variable myVar into eax, like eax = &myVar.
At least, that's how most assemblers work. Microsoft's assembler (called MASM), and the Microsoft C/C++ compiler's inline assembly, is a bit different. It treats the above two instructions as equivalent, essentially ignoring the brackets around memory operands.
To get the address of a variable in MASM, you would use the OFFSET keyword:
mov eax, OFFSET myVar
However, even though MASM has this forgiving syntax and allows you to be sloppy, you shouldn't. Always include the brackets when you want to dereference a variable and get its actual value. You will never get the wrong result if you write the code explicitly using the proper syntax, and it'll make it easier for others to understand. Plus, it'll force you to get into the habit of writing the code the way that other assemblers will expect it to be written, rather than relying on MASM's "do what I mean, not what I write" crutch.
Speaking of that "do what I mean, not what I write" crutch, MASM also generally allows you to get away with omitting the operand-size specifier, since it knows the size of the variable. But again, I recommend writing it for clarity and consistency. Therefore, if myVar is an int, you would do:
mov eax, DWORD PTR [myVar] ; eax = myVar
or
mov DWORD PTR [myVar], eax ; myVar = eax
This notation is necessary in other assemblers like NASM that are not strongly-typed and don't remember that myVar is a DWORD-sized memory location.
You don't need this at all when dereferencing register operands, since the name of the register indicates its size. al and ah are always BYTE-sized, ax is always WORD-sized, eax is always DWORD-sized, and rax is always QWORD-sized. But it doesn't hurt to include it anyway, if you like, for consistency with the way you notate memory operands.
Also when I try to do mov eax, [ebx] I get a compile error, why is this?
Um…you shouldn't. This assembles fine for me in MSVC's inline assembly. As we have already seen, it is equivalent to:
mov eax, DWORD PTR [ebx]
and means that the memory location pointed to by ebx will be dereferenced and that DWORD-sized value will be loaded into eax.
why I cant do mov a, [eax] Should that not make "a" a pointer to wherever eax is pointing?
No. This combination of operands is not allowed. As you can see from the documentation for the MOV instruction, there are essentially five possibilities (ignoring alternate encodings and segments):
mov register, register ; copy one register to another
mov register, memory ; load value from memory into register
mov memory, register ; store value from register into memory
mov register, immediate ; move immediate value (constant) into register
mov memory, immediate ; store immediate value (constant) in memory
Notice that there is no mov memory, memory, which is what you were trying.
However, you can make a point to what eax is pointing to by simply coding:
mov DWORD PTR [a], eax
Now a and eax have the same value. If eax was a pointer, then a is now a pointer to that same memory location.
If you want to set a to the value that eax is pointing to, then you will need to do:
mov eax, DWORD PTR [eax] ; eax = *eax
mov DWORD PTR [a], eax ; a = eax
Of course, this clobbers the pointer and replaces it with the dereferenced value. If you don't want to lose the pointer, then you will have to use a second "scratch" register; something like:
mov edx, DWORD PTR [eax] ; edx = *eax
mov DWORD PTR [a], edx ; a = edx
I realize this is all somewhat confusing. The mov instruction is overloaded with a large number of potential meanings in the x86 ISA. This is due to x86's roots as a CISC architecture. By contrast, modern RISC architectures do a better job of separating register-register moves, memory loads, and memory stores. x86 crams them all into a single mov instruction. It's too late to go back and fix it now; you just have to get comfortable with the syntax, and sometimes it takes a second glance.

How does interaction with computer hardware look in Lisp?

In C, it is easy to manipulate memory and hardware registers, because concepts such as "address" and "volatile" are built into the language. Consequently, most OSs are written in the C family of languages. For example, I can copy an arbitrary function to an arbitrary location in memory, then call that location as a function (assuming the hardware doesn't stop me from executing data, of course; this would work on certain microcontrollers).
int hello_world()
{
printf("Hello, world!");
return 0;
}
int main()
{
unsigned char buf[1000];
memcpy(buf, (const void*)hello_world, sizeof buf);
int (*x)() = (int(*)())buf;
x();
}
However, I have been reading about the Open Genera operating system for certain dedicated Lisp machines. Wikipedia says:
Genera is written completely in Lisp; even all the low-level system code is written in Lisp (device drivers, garbage collection, process scheduler, network stacks, etc.)
I am completely new to Lisp, but this seems like a difficult thing to do: Common Lisp, from what I've seen, doesn't have good abstractions for the hardware it's running on. How would Common Lisp operating systems do something basic such as compile the following trivial function, write its machine code representation to memory, then call it?
(defun hello () (format t "Hello, World!"))
Of course, Lisp can be easily implemented in itself, but in the words of Sam Hughes, "somewhere down the line, abstraction runs out and a machine has to execute an instruction."

The Lisp machine was a computer hardware with a CPU just like modern machines today, only the CPU had some special instructions that mapped better to LISP. It still was a stack machine and it compiled it's source to CPU instructions just as modern Common Lisp implementations do today on more general CPUs.
In the Lisp machines wikipedia page you can see how a function gets compiled:
(defun example-count (predicate list)
(let ((count 0))
(dolist (i list count)
(when (funcall predicate i)
(incf count)))))
(disassemble (compile #'example-count))
0 ENTRY: 2 REQUIRED, 0 OPTIONAL ;Creating PREDICATE and LIST
2 PUSH 0 ;Creating COUNT
3 PUSH FP|3 ;LIST
4 PUSH NIL ;Creating I
5 BRANCH 15
6 SET-TO-CDR-PUSH-CAR FP|5
7 SET-SP-TO-ADDRESS-SAVE-TOS SP|-1
10 START-CALL FP|2 ;PREDICATE
11 PUSH FP|6 ;I
12 FINISH-CALL-1-VALUE
13 BRANCH-FALSE 15
14 INCREMENT FP|4 ;COUNT
15 ENDP FP|5
16 BRANCH-FALSE 6
17 SET-SP-TO-ADDRESS SP|-2
20 RETURN-SINGLE-STACK
This is then stored in some memory place and when running this function it just jumps or calls to this. As with any assembly code the CPU gets instructed to continue running some other code when it's done running this and it may be the Lisp main loop itself (REPL).
The same code compiled with SBCL:
; Size: 203 bytes
; 02CB9181: 48C745E800000000 MOV QWORD PTR [RBP-24], 0 ; no-arg-parsing entry point
; 189: 488B4DF0 MOV RCX, [RBP-16]
; 18D: 48894DE0 MOV [RBP-32], RCX
; 191: 660F1F840000000000 NOP
; 19A: 660F1F440000 NOP
; 1A0: L0: 488B4DE0 MOV RCX, [RBP-32]
; 1A4: 8D41F9 LEA EAX, [RCX-7]
; 1A7: A80F TEST AL, 15
; 1A9: 0F8598000000 JNE L2
; 1AF: 4881F917001020 CMP RCX, 537919511
; 1B6: 750A JNE L1
; 1B8: 488B55E8 MOV RDX, [RBP-24]
; 1BC: 488BE5 MOV RSP, RBP
; 1BF: F8 CLC
; 1C0: 5D POP RBP
; 1C1: C3 RET
; 1C2: L1: 488B45E0 MOV RAX, [RBP-32]
; 1C6: 488B40F9 MOV RAX, [RAX-7]
; 1CA: 488945D8 MOV [RBP-40], RAX
; 1CE: 488B45E0 MOV RAX, [RBP-32]
; 1D2: 488B4801 MOV RCX, [RAX+1]
; 1D6: 48894DE0 MOV [RBP-32], RCX
; 1DA: 488B55F8 MOV RDX, [RBP-8]
; 1DE: 4883EC18 SUB RSP, 24
; 1E2: 48896C2408 MOV [RSP+8], RBP
; 1E7: 488D6C2408 LEA RBP, [RSP+8]
; 1EC: B902000000 MOV ECX, 2
; 1F1: FF1425B80F1020 CALL QWORD PTR [#x20100FB8] ; %COERCE-CALLABLE-TO-FUN
; 1F8: 488BC2 MOV RAX, RDX
; 1FB: 488D5C24F0 LEA RBX, [RSP-16]
; 200: 4883EC18 SUB RSP, 24
; 204: 488B55D8 MOV RDX, [RBP-40]
; 208: B902000000 MOV ECX, 2
; 20D: 48892B MOV [RBX], RBP
; 210: 488BEB MOV RBP, RBX
; 213: FF50FD CALL QWORD PTR [RAX-3]
; 216: 480F42E3 CMOVB RSP, RBX
; 21A: 4881FA17001020 CMP RDX, 537919511
; 221: 0F8479FFFFFF JEQ L0
; 227: 488B55E8 MOV RDX, [RBP-24]
; 22B: BF02000000 MOV EDI, 2
; 230: 41BBF0010020 MOV R11D, 536871408 ; GENERIC-+
; 236: 41FFD3 CALL R11
; 239: 488955E8 MOV [RBP-24], RDX
; 23D: E95EFFFFFF JMP L0
; 242: CC0A BREAK 10 ; error trap
; 244: 02 BYTE #X02
; 245: 19 BYTE #X19 ; INVALID-ARG-COUNT-ERROR
; 246: 9A BYTE #X9A ; RCX
; 247: L2: CC0A BREAK 10 ; error trap
; 249: 02 BYTE #X02
; 24A: 02 BYTE #X02 ; OBJECT-NOT-LIST-ERROR
; 24B: 9B BYTE #X9B ; RCX
NIL
Not quite as few instructions is it. When running this function that is the machine code that gets control and it gives the control back to the system since the return address is perhaps the REPL or next instruction just like with compiled C.
A special thing about lisps in general is that lexical closures need to be handled. In C when a call is done the variables don't exist anymore, but in Lisps it may return or store a function that use those variables at a later time and that is no longer in scope. This means variables need to be handled almost as inefficient as in interpreted code in compiled code, especially with a old compiler that doesn't do much optimization.
A C compiler does it fare translating as well or else what would be the reason for programming C than in assembly? The Intel x86 processors doesn't have support for arguments in procedure calls. It is emulated by the C compiler. The caller sets values on the stack and it has a cleanup where it undoes it afterward. looping constructs such as for and while doesn't exist. Only branch/jmp. Yes, in C you get a more feel for the underlying hardware but it really isn't the same as machine code. It only leaks more.
A Lisp implementation as OS can have features such as low level assembly instructions as lisp opcodes. Compilation would then be to translate everything to low level lisp, then it's a 1:1 from those to machince bytes.
An operating system with a c library and a c compiler together does the very same thing. It runs translation to machine code and can then run the code in itself. This is how Lisp systems are meant to work too so the only thing you need is the API to hardware that can be as low level as memory mapping I/O etc.

Even without abstraction lisp can emit assembler. See
Movitz's network code
An ARM assembler
But it can also be used to create a thin but powerful abstraction over machine code. See Henry Bakers's Comfy Compiler
Finally check SBCL VOP's (example), they allow you to control what assembly code. Altough with virtual registers, as this happens before register allocation.
You may find this post interesting, as it deals with how to emit assembly from SBCL.
Btw, even though you can write drivers and such in lisp, it is not a good idea to needlessly duplicate the effort, so even Lisp implementations in Lisp, like SBCL, have some C parts to allow interfacing with the OS.
These C header files, along with the C source and assembly files, are
then used (figure 2) to produce the sbcl executable itself. The
executable is as yet not useful; while it provides an interface to the
operating system services, and a garbage collector
Taken from seccion 3.2 from SBCL: a Sanely-Bootstrappable Common Lisp
I haven't checked out how Mezzano works, feel free to dig into it.

Lisp Machines have a few low-level internal functions that allow them to access memory and hardware registers directly. These are used in the guts of the operating system.

asm pointers, what am I getting wrong?

int sort(int* list)
{
__asm
{
mov esi, [list];
mov eax, dword ptr[esi + edx * 4]; store pointer to eax?
mov edi, dword ptr[esi + 4 + edx * 4]; store pointer to edi?
jmp swap;
swap:
push dword ptr[esi + edx * 4];
mov dword ptr[esi + edx * 4], edi;
pop dword ptr[esi + 4 + edx * 4];
This is a portion of my homework code, it works properly but I want to know how I can change my swap to use registers instead of dword ptrs. I initially had:
swap: (none of this works... values remain unchanged. why? =[ )
push eax; supposed to push value pointed to?
mov eax, edi; supposed to change value pointed at by eax?
pop edi; supposed to pop pushed value into edi pointer location?
but this doesn't actually swap anything, the array passed in doesn't change. How can I get rewrite my code so that the swap statement looks like this? I tried putting [] around eax in the above swap statement but that doesn't work either.

With three instructions (as Kerrek SB said) and only one register (EAX):
int exchange ()
{ int list[5] = {1,5,2,4,3};
__asm { mov edx, 0
lea esi, list
// SWAP WITH THREE INSTRUCTIONS.
mov eax, [esi + edx * 4]
xchg [esi + 4 + edx * 4], eax
mov [esi + edx * 4], eax
// NOW LIST = {5,1,2,4,3};
}
}
Or, with the array as parameter :
int exchange ( int * list )
{ __asm { mov edx, 0
mov esi, list
// SWAP WITH THREE INSTRUCTIONS.
mov eax, [esi + edx * 4]
xchg [esi + 4 + edx * 4], eax
mov [esi + edx * 4], eax
// LIST = {5,1,2,4,3};
}
}
And this is how to call it :
int list[5] = {1,5,2,4,3};
exchange( list );

From what I understand you want to swap two double word values within two different arrays and you want to do this using two registers. You are loading EAXand EDI with two values (one from each array), after you swap the register values you need to store/save them back into their respective array offsets in memory for their values to change. So continuing your line of code, try:
Push Eax
Mov Eax, Edi
Pop Edi
Mov dword ptr[esi + 4 + edx * 4], Eax
Mov dword ptr[esi + edx * 4], Edi
You can leave out the dword ptr type override prefix when the destination operand is an extended register, I believe it will be assumed that the source value will be the same size (double word). So this will also work:
mov eax, [esi + edx * 4]
mov edi, [esi + 4 + edx * 4]
Also, do you have to use that mode of addressing? It seems you are using indirect indexed displacement addressing.

Part of the confusion might be how your function receives its inputs. If you write your whole function in asm, rather than inline with MSVC-specific syntax, then the ABI tells you that your parameters will be on the stack (for 32bit x86 code). http://www.agner.org/optimize/ has a calling-conventions doc, too, covering the various different calling conventions for x86 and x86-64.
Anyway.
xchg might seem like exactly the instruction you want for doing a swap. If you really do need to exchange the contents of two registers, it's very similar in performance to the 3 mov instructions that would otherwise be required, but without the temporary register needed. However, it's somewhat rare to actually need to swap two registers, rather than just overwrite one, or save the old value somewhere else. Also, 3 mov reg, reg will be faster on Ivy Bridge / Haswell, because they don't need an execution unit for it; they just handle it in the register rename stage (with 0 latency).
For swapping the contents of two memory locations, it's at least 25 times slower than using mov for loads/stores, due to the implicit LOCK prefix forcing the CPU to make sure all other cores get the update right away, instead of just writing to L1 cache.
What you need to do is 2 loads, and 2 stores.
The simplest form (2 loads, 2 stores, works in the general case) will be
# void swap34(int *array)
swap34:
# 32bit: mov edi, [esp+4] # [esp] has the return address
# 64bit windows: mov rdi, rcx # first function arg comes in rcx
# array pointer in rdi, assuming 64bit SystemV (Linux) ABI.
mov eax, [rdi+8] # load array[3]
mov ebx, [rdi+12] # load array[4]
mov [rdi+12], eax # store array[4] = tmp1
mov [rdi+8], ebx # store array[3] = tmp2
ret
With more complex addressing modes (e.g. [rdi+rax*4], you could swap list[rax] with list[rbx].)
If the memory locations are adjacent, and you can load both at once with a wider load, and rotate to swap. e.g.
# int *array in rdi
mov rax, [rdi+4] # load 2nd and 3rd 32bit element
rol rax, 32 # rotate left by half the reg width
mov [rdi+4], rax # store back to the same place
I believe those 3 instructions will actually run faster than rol [rdi+4], 32. (rotate with memory operand and imm8 count is 4 uops on Intel Sandybridge, throughput of 1 per 2 cycles. The load/rot/store is 3 uops, and should sustain 1 per cycle. The memory-operand version uses fewer instruction bytes. It doesn't leave either value in a register, though. Usually in real code, you're going to want to do something further with one of the values.)
The only other way I can think of to use fewer instructions would if you had rsi and rdi pointing at the values to be swapped. Then you could
movd eax, [rdi] ; DON'T DO THIS,
movsd ; string-move, 4B version. copies [rsi] to [rdi]
movd [rsi-4], eax ; IT'S SLOW
This would be a lot slower than 2 loads / 2 stores, and movsd increments rsi and rdi. Saving an instruction here actually makes for slower code, and code that uses MORE space in the uop cache on recent Intel design. (A movsd without a rep prefix is never a good choice.)
Another instruction that reads from one memory location and writes to another is pop or push with a memory operand, but that only works if the stack pointer was already pointing to one of the values you wanted to swap, and you didn't care about changing the stack pointer.
Don't go messing with the stack pointer. You can in theory save the stack pointer somewhere, and use it as another GP register for a loop where you're out of registers, but only if you don't need to call anything, and nothing asynchronous can happen that might try to use the stack while you have rsp not pointing at the stack. Seriously, it's really rare for even hand-written performance-tuned asm to use the stack pointer for anything but the normal use, so really just forget I mentioned it.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex