Does the processor use more than one stack to separate the call stack from the expression/register stack? - cpu-registers

I was reading some basic articles about memory manipulation by the processor, and I was confused as to how the processor handles what comes next.
The concept of the call stack is clear, but I was wondering if the expression stack/register stack (used to make the calculations) is the same stack, or even if the stack for the local variables of a subroutine (a function) in a program is the same call stack.
If anyone could explain to me how the processor operates regarding its stack(s), that'd help me a lot.

All the processors I've worked on have just used a single stack for these.
If you think about what the processor is doing, you only need a single stack. During calculations you can use the same stack as the calling stack, as when the calculation is complete the stack will be 'clean' again. Same for local variables, just before you go out of the scope of the local variables your stack will be clean allowing the call to return correctly.

You can change the stack just set the SS:SP segment and pointer registers (just save the current values)
The procedure call parameters and local variables takes place in the stack. And the dynamically created objects take place in the heap (DS:DI). The SS:SP register pair shifted by the right amount of bytes to reserve the needed memory on the procedure call. And on the return the SS:SP sets back to the pre call state.

Related

Is explicit stack better than recursion

We can print a linked list in reverse order with a stack as well as using recursion. My teacher said that using explicit stack is better since recursion also uses stack but has to maintain a lot of other parameters. Even if we use std::stack from stack, doesn't referring to an external library also take up time? How does using an explicit stack save time/space compared to using a recursive solution?
Recursion involves the use of implicit stacks. This is implemented in the background by the compiler being used to compile your code.
This background stack created by the compiler is known as a ‘Call stack’. Call Stack can be implemented using stack data structure which stores information about the active subroutines of a computer program.
Each subroutine call uses a frame inside the call stack called the stack frame. When a function returns a value, it’s stack frame is popped off the call stack.
Recursion's Call Stack vs Explicit Call Stack?
Stack overflow
The fundamental difference between the 2 stacks is that the space allocated by the compiler for the call stack of a program is fixed. This means there will be a stack overflow if you’re not sure about the maximum no. of recursive function calls expected and there are way too many calls than the space allocated to the stack can handle at a given point of time.
On the other hand, if you define an explicit stack, it’s implemented on the heap space allocated to the program by the compiler at run time. And guess what, the heap size is not fixed and can increase dynamically during run-time when required. You don’t really have to worry about the explicit stack overflowing.
Space and Time
Which one will be faster for a given situation?
Iterating on an explicit stack can be faster than recursion in languages that don’t support recursion related optimizations such as tail call optimization for tail recursion.
What’s Tail recursion?
Tail recursion is a special case of recursion where the recursive function doesn’t do any more computation after the recursive function call i.e. the last step of the function is a call to the recursive function.
What’s Tail-call optimization (TCO)?
Tail-call optimization is where you are able to avoid allocating a new stack frame for a function because the calling function will simply return the value that it gets from the called function.
So, compilers/languages that support tail-call optimizations implement the call to the recursive function with only a single stack frame in the call stack. If your compiler/language doesn’t support this, then using an explicit stack will save you a LOT of space and time.
Python doesn’t support tail call optimization. The main reason for this is to have a complete and clear stack trace which enables efficient debugging. Almost all C/C++ compilers support tail call optimization.
Sometimes explicitly controlling the stack helps simplify things when multiple parameters are being used.
whereas, a recursive solution makes the size of the source code a lot smaller and more maintainable.
Conclusion
In the end, There’s no fixed answer. For a particular scenario, many factors need to be considered such as scalability, code maintainability, language/compiler being used, etc.
The best way would be to implement the solution using both ways, time the 2 solutions on an input set and analyze peak space utilization before deploying it on a production setup.
See Wikipedia - Recursion versus Iteration

Memory management without stack?

To model the run-time semantics of procedures, it is known that a stack is generally needed.
If the language does not allow procedure recursion, do we have to have stacks?
And if the language does allow procedure recursion, but a recursive call can only happen at the end of a procedure, do we have to have stacks?
In Fortran, which you are probably interested in as an example, you do need stack for recursion. It is because you want local variables of recursive procedures to be independent for each invocation of the procedure. Not all have to be independent, but you generally want to have this possibility.
Without recursion, you have only one invocation of any procedure at any time so local variables can be static. Not so with recursion, you don't know how deep it will be so you need some dynamic data structure to store the data. You could emulate the stack on the heap if necessary, but you do need some dynamic memory.
Often, stack is also used for automatic (variable length) arrays, but that is not required, they can be on the heap depending on the compiler and its settings.
Stack is used to store return address to reach after a method ends execution.
Stack is also used to allocate objects with limited duration scope.
So, if your launguage does not permit automatic (C style) objects (or in other words, does not permit local scopes) or does not permit methods, I suppose language could omit stack implementation completely.
I think recursion doesn't have anything to do with requirement of stack.

What affects stack size in recursion?

Exactly what parts of a recursive method call contributes to the stack--say, the returned object, arguments, local variables, etc.?
I'm trying to optimize the levels of recursion that an Android application can do on limited memory before running into a StackOverflowException.
Thanks in advance.
If you run out of stack space, don't optimize your stack usage. Doing that just means the same problem will come back later, with a slightly larger input set or called from somewhere else. And at some point you have reached the theoretical or practical minimum of space you can consume for the problem you're solving. Instead, convert the offending code to use a collection other than the machine stack (e.g. a heap-allocated stack or queue). Doing so sometimes results in very ugly code, but at least it won't crash.
But to answer the question: Generally all the things you name can take stack space, and temporary values take space too (so nesting expressions like crazy just to save local variables won't help). Some of these will be stored in registers, depending on the calling convention, but may have to be spilled(*) anyway. But regardless of the calling convention, this only saves you a few bytes, and everything will have to be spilled for calls as the callee is given usually free reign over registers during the call. So at the time your stack overflows, the stack is indeed crowded with parameters, local variables, and temporaries of earlier calls. Some may be optimized away altogether or share a stack slot if they aren't needed at the same time. Ultimately this is up to the JIT compiler.
(*) Spilling: Moving a value from a register to memory (i.e., the stack) because the register is needed for something else.
Each method has two stack frame sizes associated with it, the stack required for arguments and local variables, and the stack required for expression evaluation. The return value only counts as part of the stack required for expression evaluation. The JVM is able to verify that the method does not exceed these sizes as it executes.
Exactly how much stack is required for variables and expression evaluation is down to the bytecode compiler. For instance it is often able to share local variable slots among variables with non-overlapping lifetimes.

Multitasking using setjmp, longjmp

is there a way to implement multitasking using setjmp and longjmp functions
You can indeed. There are a couple of ways to accomplish it. The difficult part is initially getting the jmpbufs which point to other stacks. Longjmp is only defined for jmpbuf arguments which were created by setjmp, so there's no way to do this without either using assembly or exploiting undefined behavior. User level threads are inherently not portable, so portability isn't a strong argument for not doing it really.
step 1
You need a place to store the contexts of different threads, so make a queue of jmpbuf stuctures for however many threads you want.
Step 2
You need to malloc a stack for each of these threads.
Step 3
You need to get some jmpbuf contexts which have stack pointers in the memory locations you just allocated. You could inspect the jmpbuf structure on your machine, find out where it stores the stack pointer. Call setjmp and then modify its contents so that the stack pointer is in one of your allocated stacks. Stacks usually grow down, so you probably want your stack pointer somewhere near the highest memory location. If you write a basic C program and use a debugger to disassemble it, and then find instructions it executes when you return from a function, you can find out what the offset ought to be. For example, with system V calling conventions on x86, you'll see that it pops %ebp (the frame pointer) and then calls ret which pops the return address off the stack. So on entry into a function, it pushes the return address and frame pointer. Each push moves the stack pointer down by 4 bytes, so you want the stack pointer to start at the high address of the allocated region, -8 bytes (as if you just called a function to get there). We will fill the 8 bytes next.
The other thing you can do is write some very small (one line) inline assembly to manipulate the stack pointer, and then call setjmp. This is actually more portable, because in many systems the pointers in a jmpbuf are mangled for security, so you can't easily modify them.
I haven't tried it, but you might be able to avoid the asm by just deliberately overflowing the stack by declaring a very large array and thus moving the stack pointer.
Step 4
You need exiting threads to return the system to some safe state. If you don't do this, and one of the threads returns, it will take the address right above your allocated stack as a return address and jump to some garbage location and likely segfault. So first you need a safe place to return to. Get this by calling setjmp in the main thread and storing the jmpbuf in a globally accessible location. Define a function which takes no arguments and just calls longjmp with the saved global jmpbuf. Get the address of that function and copy it to your allocated stacks where you left room for the return address. You can leave the frame pointer empty. Now, when a thread returns, it will go to that function which calls longjmp, and jump right back into the main thread where you called setjmp, every time.
Step 5
Right after the main thread's setjmp, you want to have some code that determines which thread to jump to next, pulling the appropriate jmpbuf off the queue and calling longjmp to go there. When there are no threads left in that queue, the program is done.
Step 6
Write a context switch function which calls setjmp and stores the current state back on the queue, and then longjmp on another jmpbuf from the queue.
Conclusion
That's the basics. As long as threads keep calling context switch, the queue keeps getting repopulated, and different threads run. When a thread returns, if there are any left to run, one is chosen by the main thread, and if none are left, the process terminates. With relatively little code you can have a pretty basic cooperative multitasking setup. There are more things you probably want to do, like implement a cleanup function to free the stack of a dead thread, etc. You can also implement preemption using signals, but that is much more difficult because setjmp doesn't save the floating point register state or the flags registers, which are necessary when the program is interrupted asynchronously.
It may be bending the rules a little, but GNU pth does this. It's possible, but you probably shouldn't try it yourself except as an academic proof-of-concept exercise, use the pth implementation if you want to do it seriously and in a remotely portable fashion -- you'll understand why when you read the pth thread creation code.
(Essentially it uses a signal handler to trick the OS into creating a fresh stack, then longjmp's out of there and keeps the stack around. It works, evidently, but it's sketchy as hell.)
In production code, if your OS supports makecontext/swapcontext, use those instead. If it supports CreateFiber/SwitchToFiber, use those instead. And be aware of the disappointing truth that one of the most compelling use of coroutines -- that is, inverting control by yielding out of event handlers called by foreign code -- is unsafe because the calling module has to be reentrant, and you generally can't prove that. This is why fibers still aren't supported in .NET...
This is a form of what is known as userspace context switching.
It's possible but error-prone, especially if you use the default implementation of setjmp and longjmp. One problem with these functions is that in many operating systems they'll only save a subset of 64-bit registers, rather than the entire context. This is often not enough, e.g. when dealing with system libraries (my experience here is with a custom implementation for amd64/windows, which worked pretty stable all things considered).
That said, if you're not trying to work with complex external codebases or event handlers, and you know what you're doing, and (especially) if you write your own version in assembler that saves more of the current context (if you're using 32-bit windows or linux this might not be necessary, if you use some versions of BSD I imagine it almost definitely is), and you debug it paying careful attention to the disassembly output, then you may be able to achieve what you want.
I did something like this for studies.
https://github.com/Kraego/STM32L476_MiniOS/blob/main/Usercode/Concurrency/scheduler.c
The context/thread switching is done by setjmp/longjmp. The difficult part was to get the allocated stack correct (see allocateStack()) this depends on your platform.
This is just a demonstration how this could work, I would never use this in production.
As was already mentioned by Sean Ogden,
longjmp() is not good for multitasking, as
it can only move the stack upward and can't
jump between different stacks. No go with that.
As mentioned by user414736, you can use getcontext/makecontext/swapcontext
functions, but the problem with those is that
they are not fully in user-space. They actually
call the sigprocmask() syscall because they switch
the signal mask as part of the context switching.
This makes swapcontext() much slower than longjmp(),
and you likely don't want the slow co-routines.
To my knowledge there is no POSIX-standard solution to
this problem, so I compiled my own from different
available sources. You can find the context-manipulating
functions extracted from libtask here:
https://github.com/dosemu2/dosemu2/tree/devel/src/base/lib/mcontext
The functions are:
getmcontext(), setmcontext(), makemcontext() and swapmcontext().
They have the similar semantic to the standard functions with similar names,
but they also mimic the setjmp() semantic in that getmcontext()
returns 1 (instead of 0) when jumped to by setmcontext().
On top of that you can use a port of libpcl, the coroutine library:
https://github.com/dosemu2/dosemu2/tree/devel/src/base/lib/libpcl
With this, it is possible to implement the fast cooperative user-space
threading. It works on linux, on i386 and x86_64 arches.

Difference between JUMP and CALL

How is a JUMP and CALL instruction different? How does it relate to the higher level concepts such as a GOTO or a procedure call? (Am I correct in the comparison?)
This is what I think:
JUMP or GOTO is a transfer of the control to another location and the control does not automatically return to the point from where it is called.
On the other hand, a CALL or procedure/function call returns to the point from where it is called. Due to this difference in their nature, languages typically make use of a stack and a stack frame is pushed to "remember" the location to come back for each procedure called. This behaviour applies to recursive procedures too. In case of tail recursion, there is however no need to "push" a stack frame for each call.
Your answers and comments will be much appreciated.
You're mostly right, if you are talking about CALL/JMP in x86 assembly or something similar. The main difference is:
JMP performs a jump to a location, without doing anything else
CALL pushes the current instruction pointer on the stack (rather: one after the current instruction), and then JMPs to the location. With a RET you can get back to where you were.
Usually, CALL is just a convenience function implemented using JMP. You could do something like
movl $afterJmp, -(%esp)
jmp location
afterJmp:
instead of a CALL.
You're exactly right about the difference between a jump and a call.
In the sample case of a single function with tail recursion, then the compiler may be able to reuse the existing stack frame. However, it can become more complicated with mutually recursive functions:
void ping() { printf("ping\n"); pong(); }
void pong() { printf("pong\n"); ping(); }
Consider the case where ping() and pong() are more complex functions that take different numbers of parameters. Mark Probst's paper talks about tail recursion implementation for GCC in great detail.
I think you've got the general idea.
It depends on the architecture, but in general, at the hardware level:
A jump instruction will change the program counter to continue execution at a different part of the program.
A call instruction will push the current program location (or current location + 1) to the call stack and jump to another part of a program. A return instruction will then pop the location off of the call stack and jump back to the original location (or orignal location + 1).
So, a jump instruction is close to a GOTO, while a call instruction is close to a procedural/function call.
Also, for the reason that a call stack is used when making function calls, pushing too many return addresses to the call stack by recursion will cause a stack overflow.
When learning assembly, I find it easier when dealing with RISC processors than x86 processors, as it tends to have less instruction and simpler operations.
One correction to your thoughts: It is not only with tail recursion, but generally with tail calls, that we don't need the stack frame and hence simply could JMP there (provided the arguments have been set up correctly).
According to microprocessor, firstly the condition is checked and then it performs jump operations (goes to other code) and does not return.
Call operation is like a function callig in the c language and when the function is performed it returns back to complete its execution.

Resources