why fork and exec are kept 2 seperate calls - unix

I understand the differences between fork, vfork, exec, execv, execp. So pls dont rant about it.
My question is about the design of the unix process creation. Why did the designers think of creating 2 seperate calls ( fork and exec ) instead of keeping it one tight call ( spawn ).
Was good API design a reason so that developers had more control over process creation?
Is it because of performance reason, so that we could delay allocating process table and other kernel structures to the child till either copy-on-write or copy-on-access?

The main reason is likely that the separation of the fork() and exec() steps allows arbitrary setup of the child environment to be done using other system calls. For example, you can:
Set up an arbitrary set of open file descriptors;
Alter the signal mask;
Set the current working directory;
Set the process group and/or session;
Set the user, group and supplementary groups;
Set hard and soft resource limits;
...and many more besides. If you were to combine these calls into a single spawn() call, it would have to have a very complex interface, to be able to encode all of these possible changes to the child's environment - and if you ever added a new setting, the interface would need to be changed. On the other hand, separate fork() and exec() steps allow you to use the ordinary system calls (open(), close(), dup(), fcntl(), ...) to manipulate the child's environment prior to the exec(). New functionality (eg. capset()) is easily supported.

fork and exec do completely different things.
fork() - duplicates a process
exec() - replaces a process
There's plenty of reasons to use one without the other. You can fork off child processes that perform tasks on behalf of your controlling parent app e.g., pretty common in the unix world. And you can e.g. setup the preconditions for some other quirky application and then exec it from your launcher application without ever using fork.

From the draft book, xv6 a simple, Unix-like teaching operating system used in MIT course 6.828:
Now it should be clear why it is a good idea that fork and exec are separate calls. Because if they are separate, the shell can fork a child, use open, close, dup in the child to change the standard input and output file descriptors, and then exec. No changes to the program being exec-ed (cat in our example) are required. If fork and exec were combined into a single system call, some other (probably more complex) scheme would be required for the shell to redirect standard input and output, or the program itself would have to understand how to redirect I/O.
Basically as caf says below but providing a good reference. Implementing a simple shell would probably make the reasons very clear.

AFAIK, initially there was only fork(). But performance reasons called for creation of exec() that does not re-create the kernel structures that are going to be immediately overwritten. <-- This is wrong.
There's a performance problem when a process needs co create a child process which is not a copy of itself. A fork() copies process-related kernel data which are going to be immediately replaced by an exec(). Thus vfork() was introduced that does not copy excessive kernel data and any process data; after it, a process is expected to call something exec()-like, and the parent is suspended until the child does so. See 'Bugs' section for description of problems with vfork(), though.

fork() can only be used for creating a child process.Only replication of parent can be done.
While exec() can be used for starting any new process on the system.
I do not see any correlation above two.

Related

Can a process tell whether it was started directly by init?

On a Unix system, there are two circumstances where getppid will return 1: either the calling process' original parent has exited, or it was directly started by init. Sometimes one might wish to behave differently depending on which of the two it is; for example, the program in this question wants to exit when its parent does, and one way to detect that is to notice that the value returned by getppid is now 1, but init never exits, so if the program was directly started by init it should not exit. You can't tell by calling getppid at the very beginning of main, because your parent might have exited before you ever get a chance to run — if it did the double-fork trick to dissociate from a shell, for instance.
In days of yore, being directly started by init was only a realistic possibility for a tiny handful of processes, /etc/rc and maybe some gettys, but nowadays we have containers and much more powerful "system manager" programs run as init, so it's much more likely to come up.
So the question is, is there any 100% reliable way for a program to tell whether its original parent was init? OS-specific answers are OK, I'm almost certain there's no way to do it within POSIX.

Why does zumero_sync need to be called multiple times?

According to the documentation for zumero_sync:
If a large amount of information needs to be pulled from the server,
this function may need to be called more than once.
In my Android app that uses Zumero that's no problem; I just keep calling zumero_sync until the return value doesn't start with "0;".
However, now I'm trying to write an admin script that also syncs with my server dbfiles. I'd like to use the sqlite3 shell, and have the script pass the SQL to execute via command line arguments. I need to call zumero_sync in a loop (which SQLite doesn't support) to make sure the db is fully synced. If I had to, I could invoke sqlite3 in a loop (reading its output, looking for "0;"), or even write a C++ app to call the SQLite/Zumero functions natively. But it certainly would be easier if a single zumero_sync was enough.
I guess my real question is: could zumero_sync be changed so it completes the sync before returning? If there are cases where the existing behavior is more useful, maybe there could be a parameter for specifying which mode to use?
I see two basic questions here:
(1) Why does zumero_sync() work the way it does?
(2) Can it work differently?
I'll answer (2) first, since it's easier: Yes, it could work differently. Rather, we could (and probably will, soon, you brought this up) implement an additional function, named something like zumero_sync_complete(), which performs [the guts of] zumero_sync() in a loop and returns after the sync is complete.
We didn't implement zumero_sync_complete() because it doesn't add much value. It's a simple loop, so you can darn well write it yourself. :-)
Er, except in scripting environments which don't support loops. Like the sqlite3 shell.
Answer to (1):
The Zumero sync protocol is designed to give the server the flexibility to return partial results if it wants to do so. And for the sake of reducing load on the server (and increasing its scalability) it often does want to do exactly that.
Given that, one reason to expose this to the client is to increase the client's flexibility as well. As long we're making multiple roundtrips, we might as well give the client an opportunity to do something (like, maybe, update a progress bar) in between them.
Another thing a client might want to do in between loop iterations is handle an error.
Or, in the case of a multithreaded client, it might want to deal with changes that happened on the client while the sync is going on.
Which raises the question of how locking should be managed? Do we hold the sqlite write lock during the entire loop? Or only when absolutely necessary?
Bottom line: A robust app would probably want to implement the loop itself so that it can make its own decisions and retain full control over things.
But, as you observe, the sqlite3 shell doesn't have loops. And it's not an app. And it doesn't have threads. Or progress bars. So it's a use case where a simpler-and-less-powerful form of zumero_sync() would make sense.

Possible to fork a process outside from it?

Well, it is obvious, let's say we have two processes A and F. F wants to fork A when it has the CPU control (and A is suspended since CPU is on F).
I have Googled however nothing related showed up. Is such a thing possible in Unix environments?
There is definitely no standard and/or portable way to clone a process from the outside but depending on the OS, there are certainly possible ways to divert a process from its task and force it to clone itself or do whatever you want.
I don't think it's a good idea in any way, but it may be possible for process F to attach to A using a debugger interface such as ptrace. Doing something like suspending the target process, saving its state, diverting the process to run fork, then restoring its original state.
It should be noted that your cloning process will probably need to handle some odd cases around threads and the like.
No it's not possible.
The fork() system call makes a copy of the parent, so if you call fork() in the F process, the child will be a copy of F, there's nothing you can do to change this behavior.
The reason this is not possible is because, normally with fork(), there is exactly one difference to start out with between the two processes: the return value of the fork() call itself. Without such a call inside the code of A, there is no way for the processes to have any difference between them, so they would both be doing exactly the same thing, when normally with you want one of the processes to start doing something different.
How exactly do you think what you want to do should work?
No, this would be a huge security hole that would result in the leakage of sensitive information if it were possible.
At best, you could setup a signal handler in the parent process that would fork(2) off a child (maybe exec(2) a pre-configured child process?).
I think you would be better served by looking in to message passing between two processes that have CPU affinity setup, but even then, I think the gains would be nominal (over-optimizing a problem?).
http://www.freebsd.org/cgi/man.cgi?query=cpuset&apropos=0&sektion=0

Multitasking using setjmp, longjmp

is there a way to implement multitasking using setjmp and longjmp functions
You can indeed. There are a couple of ways to accomplish it. The difficult part is initially getting the jmpbufs which point to other stacks. Longjmp is only defined for jmpbuf arguments which were created by setjmp, so there's no way to do this without either using assembly or exploiting undefined behavior. User level threads are inherently not portable, so portability isn't a strong argument for not doing it really.
step 1
You need a place to store the contexts of different threads, so make a queue of jmpbuf stuctures for however many threads you want.
Step 2
You need to malloc a stack for each of these threads.
Step 3
You need to get some jmpbuf contexts which have stack pointers in the memory locations you just allocated. You could inspect the jmpbuf structure on your machine, find out where it stores the stack pointer. Call setjmp and then modify its contents so that the stack pointer is in one of your allocated stacks. Stacks usually grow down, so you probably want your stack pointer somewhere near the highest memory location. If you write a basic C program and use a debugger to disassemble it, and then find instructions it executes when you return from a function, you can find out what the offset ought to be. For example, with system V calling conventions on x86, you'll see that it pops %ebp (the frame pointer) and then calls ret which pops the return address off the stack. So on entry into a function, it pushes the return address and frame pointer. Each push moves the stack pointer down by 4 bytes, so you want the stack pointer to start at the high address of the allocated region, -8 bytes (as if you just called a function to get there). We will fill the 8 bytes next.
The other thing you can do is write some very small (one line) inline assembly to manipulate the stack pointer, and then call setjmp. This is actually more portable, because in many systems the pointers in a jmpbuf are mangled for security, so you can't easily modify them.
I haven't tried it, but you might be able to avoid the asm by just deliberately overflowing the stack by declaring a very large array and thus moving the stack pointer.
Step 4
You need exiting threads to return the system to some safe state. If you don't do this, and one of the threads returns, it will take the address right above your allocated stack as a return address and jump to some garbage location and likely segfault. So first you need a safe place to return to. Get this by calling setjmp in the main thread and storing the jmpbuf in a globally accessible location. Define a function which takes no arguments and just calls longjmp with the saved global jmpbuf. Get the address of that function and copy it to your allocated stacks where you left room for the return address. You can leave the frame pointer empty. Now, when a thread returns, it will go to that function which calls longjmp, and jump right back into the main thread where you called setjmp, every time.
Step 5
Right after the main thread's setjmp, you want to have some code that determines which thread to jump to next, pulling the appropriate jmpbuf off the queue and calling longjmp to go there. When there are no threads left in that queue, the program is done.
Step 6
Write a context switch function which calls setjmp and stores the current state back on the queue, and then longjmp on another jmpbuf from the queue.
Conclusion
That's the basics. As long as threads keep calling context switch, the queue keeps getting repopulated, and different threads run. When a thread returns, if there are any left to run, one is chosen by the main thread, and if none are left, the process terminates. With relatively little code you can have a pretty basic cooperative multitasking setup. There are more things you probably want to do, like implement a cleanup function to free the stack of a dead thread, etc. You can also implement preemption using signals, but that is much more difficult because setjmp doesn't save the floating point register state or the flags registers, which are necessary when the program is interrupted asynchronously.
It may be bending the rules a little, but GNU pth does this. It's possible, but you probably shouldn't try it yourself except as an academic proof-of-concept exercise, use the pth implementation if you want to do it seriously and in a remotely portable fashion -- you'll understand why when you read the pth thread creation code.
(Essentially it uses a signal handler to trick the OS into creating a fresh stack, then longjmp's out of there and keeps the stack around. It works, evidently, but it's sketchy as hell.)
In production code, if your OS supports makecontext/swapcontext, use those instead. If it supports CreateFiber/SwitchToFiber, use those instead. And be aware of the disappointing truth that one of the most compelling use of coroutines -- that is, inverting control by yielding out of event handlers called by foreign code -- is unsafe because the calling module has to be reentrant, and you generally can't prove that. This is why fibers still aren't supported in .NET...
This is a form of what is known as userspace context switching.
It's possible but error-prone, especially if you use the default implementation of setjmp and longjmp. One problem with these functions is that in many operating systems they'll only save a subset of 64-bit registers, rather than the entire context. This is often not enough, e.g. when dealing with system libraries (my experience here is with a custom implementation for amd64/windows, which worked pretty stable all things considered).
That said, if you're not trying to work with complex external codebases or event handlers, and you know what you're doing, and (especially) if you write your own version in assembler that saves more of the current context (if you're using 32-bit windows or linux this might not be necessary, if you use some versions of BSD I imagine it almost definitely is), and you debug it paying careful attention to the disassembly output, then you may be able to achieve what you want.
I did something like this for studies.
https://github.com/Kraego/STM32L476_MiniOS/blob/main/Usercode/Concurrency/scheduler.c
The context/thread switching is done by setjmp/longjmp. The difficult part was to get the allocated stack correct (see allocateStack()) this depends on your platform.
This is just a demonstration how this could work, I would never use this in production.
As was already mentioned by Sean Ogden,
longjmp() is not good for multitasking, as
it can only move the stack upward and can't
jump between different stacks. No go with that.
As mentioned by user414736, you can use getcontext/makecontext/swapcontext
functions, but the problem with those is that
they are not fully in user-space. They actually
call the sigprocmask() syscall because they switch
the signal mask as part of the context switching.
This makes swapcontext() much slower than longjmp(),
and you likely don't want the slow co-routines.
To my knowledge there is no POSIX-standard solution to
this problem, so I compiled my own from different
available sources. You can find the context-manipulating
functions extracted from libtask here:
https://github.com/dosemu2/dosemu2/tree/devel/src/base/lib/mcontext
The functions are:
getmcontext(), setmcontext(), makemcontext() and swapmcontext().
They have the similar semantic to the standard functions with similar names,
but they also mimic the setjmp() semantic in that getmcontext()
returns 1 (instead of 0) when jumped to by setmcontext().
On top of that you can use a port of libpcl, the coroutine library:
https://github.com/dosemu2/dosemu2/tree/devel/src/base/lib/libpcl
With this, it is possible to implement the fast cooperative user-space
threading. It works on linux, on i386 and x86_64 arches.

Hanging a Child Process

I am trying to test out my system and wish to emulate a condition, where the child process gets hung. For doing this, I am trying to attach the child process to GDB and putting a break on it. But things don't seem to be going as expected.
Also, in the same vein, how do I know that a spawned child process is not progressing, but is hung?
Use can use SIGSTOP to hang a child process - but that is observably different from the child process going into an infinite loop, or a bad conditional wait - still it may be close enough for testing.
To check a child process has not hung, you have it send heart-beats to the parent (you'll need some kind of communications channel for this - maybe stdin/stdout at a minimum). Then the child has hung if it fails to send a couple of heart-beats messages.
A child process will inherit any pipes created before the fork. You can use this to both "hang" your child and to let it know when to continue. You can have your child process try a blocking read on the pipe, and it will block (i.e. hang) until the parent writes something.
You could also use signals like Douglass mentions. You can let the OS do basic stop/cont or you can implement signal handlers to do something more complex (like entering an infinite loop).
Examples for both of these can be found in the Unix Programming FAQ along with a ton of additional information on process control, signal handling, pipes, etc...
You can try looking in /proc to determine if you are hung. You can read /proc/<child-pid>/stat to get a lot of low-level process information including the current state, the amount of user/kernel time the process has been scheduled, the current stack and instruction pointers, etc... Using a combination of this you can try to determine if the process is hung or not. Check out the proc(5) man page for more info on /proc/<pid>/stat.

Resources