Confusion as to how fork() and exec() work - unix

Consider the following:
Where I'm getting confused is in the step "child duplicate of parent". If you're running a process such as say, skype, if it forks, is it copying skype, then overwriting that process copy with some other program? Moreover, what if the child process has memory requirements far different from the parent process? Wouldn't assigning the same address space as the parent be a problem?
I feel like I'm thinking about this all wrong, perhaps because I'm imagining the processes to be entire programs in execution rather than some simple instruction like "copy data from X to Y".

All modern Unix implementations use virtual memory. That allows them to get away with not actually copying much when forking. Instead, their memory map contains pointers to the parent's memory until they start modifying it.
When a child process exec's a program, that program is copied into memory (if it wasn't already there) and the process's memory map is updated to point to the new program.

fork(2) is difficult to understand. It is explained a lot, read also fork (system call) wikipage and several chapters of Advanced Linux Programming. Notice that fork does not copy the running program (i.e. the /usr/bin/skype ELF executable file), but it is lazily copying (using copy-on-write techniques - by configuring the MMU) the address space (in virtual memory) of the forking process. Each process has its address space (but might share some segments with some other processes, see mmap(2) and execve(2) ....). Since each process has its own address space, changes in the address space of one process does not (usually) affect the parent process. However, processes may have shared memory but then need to synchronize: see shm_overview(7) & sem_overview(7)...
By definition of fork, just after the fork syscall the parent and child processes have nearly equal state (in particular the address space of the child is a copy of the address space of the parent). The only difference being the return value of fork.
And execve is overwriting the address space and registers of the current process.
Notice that on Linux all processes (with a few exceptions, like kernel started processes such as /sbin/modprobe etc) are obtained by fork-ing -from the initial /sbin/init process of pid 1.
At last, system calls -listed in syscalls(2)- like fork are an elementary operation from the application's point of view, since the real processing is done inside the Linux kernel. Play with strace(1). See also this answer and that one.
A process is often some machine state (registers) + its address space + some kernel state (e.g. file descriptors), etc... (but read about zombie processes).
Take time to follow all the links I gave you.

Related

Why does process creation using `clone` result in an out-of-memory failure?

I have a process that allocates about 20GB of RAM on a 32GB machine. After some events, I'm streaming the data from the parent process to stdin of the child process. It's mandatory to keep the 20GB of data in the parent process at the point when the child is spawned.
The app is written in Rust and I'm calling Command::new('path/to/command') to create the child process.
When I spawn the child process the operating system is trapping an out-of-memory error.
strace output:
[pid 747] 16:04:41.128377 clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7ff4c7f87b10) = -1 ENOMEM (Cannot allocate memory)
Why does the trap occur? The child process should not consume more than 1GB and exec() is called immediately after clone().
The Problem
When a child process is created by the Rust call, several things happen at a C/C++ level. This is a simplification, but it will help explain the dilemma.
The streams are duplicated (with dup2 or a similar call)
The parent process is forked (with the fork or clone system call)
The forked process executes the child (with call from the execvp family)
The parent and child are now concurrent processes. The Rust call you are currently using appears to be a clone call that is behaving much like a pure fork, so you're 20G x 2 - 32G = 8G short, without considering the space needed by the operating system and anything else that might be running. The clone call is returning with a negative return value and errno is set by the call to ENOMEM errno.
If the architectural solutions of adding physical memory, compressing the data, or streaming it through a process that does not require the entirety of it to be in memory at any one time are not options, then the classic solution is reasonably simple.
Recommendation
Design the parent process to be lean. Then spawn two worker children, one that handles your 20GB need and the other that handles your 1 GB need1. These children can be connected to one another via pipe, file, shared memory, socket, semaphore, signalling, and/or other communication mechanism(s), just as a parent and child can be.
Many mature software packages from Apache httpd to embedded cell tower routing daemons use this design pattern. It is reliable, maintainable, extensible, and portable.
The 32G would then likely suffice for the 20G and 1G processing needs, along with OS and lean parent process.
Although this solution will surely solve your problem, if the code is to be reused or extended later, there may be value in looking into potential process design changes involving data frames or multidimensional slices to support streaming of data and memory requirement reductions.
Memory Overcommit Always
Setting overcommit_memory to 1 eliminates the clone error condition referenced in the question because the Rust call calls the LINUX clone call that reads that setting. But there are several caveats with this solution that point back to the above recommendation as superior, primarily that the value of 1 is dangerous, especially for production environments.
Background
Kernel discussions about OpenBSD rfork and the clone call ensued in the late 1990s and early 2000s. The features stemming from those discussions permit less extreme forking than processes, which is symmetrically like the provision of more extensive independence between pthreads. Some of these discussions have produced extensions to the traditional process spawning that have entered POSIX standardization.
In the early 2000s, Linux Torvalds suggested a flag structure to determine what components of the execution model are shared and what are copied when execution forks, blurring the distinction between processes and threads. From this, the clone call emerged.
Over-committing memory is not discussed much if any in those threads. The design goal was MORE control of the results of a fork rather than the delegation of memory usage optimization to an operating system heuristic, which is what the default setting of overcommit_memory = 0 does.
Caveats
Memory overcommit goes beyond these extensions, adding the complexity of trade-offs of its modes2, design trend caveats3, practical run time limitations4, and performance impacts5.
Portability and Longevity
Additionally, without standardization, the code using memory overcommit may not be portable, and the question of longevity is pertinent, especially when a setting controls the behavior of a function. There is no guarantee of backward compatibility or even some warning of deprication if the setting system changes.
Danger
The linuxdevcenter documentation2 says, "1 always overcommits. Perhaps you now realize the danger of this mode.", and there are other indications of danger with ALWAYS overcommitting 6, 7.
The implementers of overcommit on LINUX, Windows, and VMWare may guarantee reliability, but it is a statistical game that, combined with the many other complexities of process control, may lead to certain unstable characteristics under certain conditions. Even the name overcommit tells us something about its true character as a practice.
A non-default overcommit_memory mode, for which several warnings are issues, but works for the immediate trial of the immediate case may later lead to intermittent reliability.
Predictability and Its Impact on System Reliability and Response Time Consistency
The idea of a process in a UNIX like operating system, from its Bell Labs beginnings, is that a process makes a concrete requests to its container, the operating system. The result both predictable and binary. Either the request is denied or granted. Once granted, the process is given complete control and direct access over the resources until the use of it is relinquished by the process.
The swap space aspect of virtual memory is a breach of this principle that appears as gross deceleration of activity on workstations, when RAM is heavily consumed. For instance, there are times during development when one presses a key and has to wait ten seconds to see the character on the display.
Conclusion
There are many ways to get the most out of physical memory, but doing so by hoping that use of memory allocated will be sparse will likely introduce negative impacts. Performance hits from swapping when overcommit is overused is the well documented example. If you are keeping 20G of data in RAM, this may particularly be the case.
Only allocating what is needed, forking in intelligent ways, using threads, and freeing memory that is surely no longer needed lead to memory thrift without impacting reliability, creating spikes in swap disk usage, and can operate without caveat up to the limits of system resources.
The position of the designer of the Command::new call may be based on this perspective. In this case, how soon after the fork the exec is called is not a determining factor in how much memory is requested during the spawn.
Notes and References
[1] Spawning worker children may require some code refactoring and appear to be too much trouble on a superficial level, but the refactoring may be surprisingly straightforward and significantly beneficial.
[2] http://www.linuxdevcenter.com/pub/a/linux/2006/11/30/linux-out-of-memory.html?page=2
[3] https://www.etalabs.net/overcommit.html
[4] http://www.gabesvirtualworld.com/memory-overcommit-in-production-yes-yes-yes/
[5] https://labs.vmware.com/vmtj/memory-overcommitment-in-the-esx-server
[6] https://github.com/kubernetes/kubernetes/issues/14452
[7] http://linuxtoolkit.blogspot.com/2011_08_01_archive.html

Why does OS need Fork()?

I am learning by myself introduction of OS, I have the following two questions:
(1) Since Fork() system call is used to duplicate the current process for the sake of multitasking, I'd like to see a example that shows without forking, we will not have such multitasking? In other word, I would like to see an example (or an external link) that shows how important Fork() is.
(2) DoesZombie process exist because of the child's process crash?
Thank you very much
There is no need for fork, fork is just the Unix way of creating process. Older and different systems used something different as spawn (Vax/VMS) for example.
Zombies are just traces of died processes, this is useful for parents to be aware of died children after having been busy. Remember that in Unix, a process that terminates let its parents be aware of the cause of its termination. So there is a need to let died process store that information somewhere, Unix way is to maintain a process entry named zombie as that process is reduced to this small entry and no other resources are linked to.

How to see the process table in unix?

What's the UNIX command to see the processes table, remember that table contains:
process status
pointers
process size
user ids
process ids
event descriptors
priority
etc
The "process table" as such lives in the kernel's memory. Some systems (such as AIX, Solaris and Linux--which is not "unix") have a /proc filesystem which makes those tables visible to ordinary programs. Without that, programs such as ps (on very old systems such as SunOS 4) required elevated privileges to read the /dev/kmem (kernel memory) special device, as well as having detailed knowledge about the kernel memory layout.
Your question is open ended, and an answer to a specific question you may have had can be looked up in any man page as #Alfasin suggests in his answer. A lot depends on what you are trying to do.
As #ThomasDickey points out in his response, in UNIX and most of its' derivatives, the command for viewing processes being run in the background or foreground is in fact the ps command.
ps stands for 'process status', answering your first bullet item. But the command uses over 30 options and depending on what information you seek, and permissions granted to you by the systems administrator, you can get various types of information from the command.
For example, for the second bullet item on your list above, depending on what you are looking for, you can get information on 3 different types of pointers - the session pointer (with option 'sess'), the terminal session pointer (tsess), and the process pointer (uprocp).
The rest of your items that you have listed are mostly available as standard output of the command.
Some UNIX variants implement a view of the system process table inside of the file system to support the running of programs such as ps. This is normally mounted on /proc (see #ThomasDickey response above)
Typical reasons for understanding the working of the command include system-administration responsibilities such as tracking the origin of the initiated processes, killing runaway or orphaned processes, examining the file size of the process and setting limits where necessary, etc. UNIX developers can also use it in conjunction with ipc features, etc. An understanding of the process table and status will help with associated UNIX features such as the kvm interface to examine crash dump, etc. or to get or set the kernal state.
Hope this helps

text section in process memory map

Normally, the process memory map consists of stack, text, data+bss and heap.
The memory address is independent to other processes except text section.
My question is about in text section, is there only child process could share
the same text section with its parent process? or other processes could share it too.
======================================================================
#avd: yes, refer to the wikipedia
http://en.wikipedia.org/wiki/Process_isolation
"Process isolation can be implemented by with virtual address space, where process A's address space is different from process B's address space - preventing A to write into B."
This is what I mean to each process has its own memory map.
However, when I read the OS book, it mentions that the text section could be shared. So I am not very clear with this or probably I misunderstood that part of the book.
======================================================================
Extra information:
http://www.hep.wisc.edu/~pinghc/Process_Memory.htm
Processes share the text segment if a second copy of the program is to be executed concurrently. In this setting, the system references the previously loaded text segment with the pointer rather than reloading a duplicated. If needed, shared text, which is the default when using the C/C++ compiler, can be turned off by using the -N option on the compile time.
Every process has it's very own virtual addresses. That virtual address is not shared with anybody including child process. But these virtual addresses are translated or, in other words, mapped to physical addresses by OS kernel and MMU.
The thing is that virtual addresses from different address spaces can point to the same physical addresses! For example, when process forked, it gets its own virtual address space, but unless this child process is not changing (writing) to it's memory it shares memory with parent process for reading. When child process will try to modify some memory kernel will create separate own copy of particular page for child process and it will not be shared anymore (until child process forked itself). This is known as Copy on Write (CoW).
So the real thing is that text section could be shared by mapping different virtual pages to the same physical pages (called frames).

How to monitor child process output and exit status without adversely affecting child?

I want my program to monitor some processes that it has started. These are the most important requirements I know of:
Record the exit status of the child processes (unless they exit after my program has exited).
Record the stderr and stdout output. Ideally within a few seconds of its being written, but that is not a hard requirement: it would probably be enough to read it only when a user requests it.
Sometimes, the child processes will outlive my program. Other times, they will not. It's important that my program doesn't make the child processes more likely to exit in a way that might inconvenience my users -- for example, Unix signals sent to my program shouldn't kill the child processes as a side-effect. If the parent exits, the children should continue running unaffected.
Ideally, the parent would track forks of the children so they can be monitored and perhaps signalled. This is not a hard requirement, though.
The scheme needs to work on both Linux and OS X.
My resolution is to do all of the standard operations required to daemonize a process, except for the second fork. Before doing that, I redirect output to a temporary log file, then monitor the log file using inotify (on Linux) or kqueues (on OS X).
As far as I know, the only loss to my stability requirement (3.) of omitting the second fork is if the child process obtains a controlling tty.
Is this solution a good one for these requirements? What bad things could happen to the child processes that I haven't considered?

Resources