virtual address - unix

Suppose I'm starting two instances of the same program. Will the text region of both programs have same virtual addresses?

Depends. On most systems, if you run the same program twice in the same environment (same parameters, etc.), you'll find the same address mapping. This is simply because most of what the process does is deterministic, dependent only on the environment, command-line parameters, contents of files read, but not on changing data such as the date or process ID. This is very useful when debugging: if you restart your program, sometimes even after a small code change and recompilation, you have a chance that the memory layout remained the same. Of course, different instances of the program running concurrently may have the same virtual addresses, but they won't have the same physical addresses.
Some systems, such as OpenBSD, or Linux with various hardening settings, implement address space layout randomization (ASLR). ASLR means that each time a process starts, the virtual addresses of its code, data, stack(s) and heap(s) are determined at random. This is a security features, designed to make exploits of security vulnerabilities harder: the exploit code can't just access known code at known addresses. However, as ASLR becomes more popular, exploits also become more sophisticated to work around it. ASLR remains useful because it increases the workload for the exploit writer without adding a lot of complexity.

Probably not, but it's possible that they could. Each process has its own independent memory space.

Related

Distributed eventual consistency Key Value Store

I find it difficult to convince myself the advantage of using complex design like DynamoDB over simple duplication strategy.
Let's say we want to build a distributed key/value data store over 5 servers. (each server has exactly the same duplica).
Eventual consistency system, like DynamoDB, typically uses complicated conflicts reconcile, vector timestamp, etc. to achieve eventually consistency.
But instead, why couldn't we simply do the following:
For write, client will issue the write command to all the servers. So all servers will execute the clients' write command in the same order. It will reply to clients before servers commit the write.
For read, client will just do a round robin, only one server at a time will take care of read command. (Other servers won't see the read command)
Yes, client may experience temporary stale data, but eventually all replica will have the same dataset, which is the same semantic as DynamoDB.
What's the disadvantage of this simple design vs Complicated DynamoDB?
Your strategy has a few disadvantages, but their exact nature depends on details you haven't covered.
One obvious example is dealing with network segmentation. That is, when one part of your network becomes segmented (disconnected) from another part.
In this case, you have a couple of choices about how to react when you try to write some data to the server, and that fails. You might just assume that it worked, and continue as if everything was fine. If you do that, and the server later comes back on line, a read may return stale data.
To prevent that, you might treat a failed write as a true failure, and refuse to accept the write until/unless all servers confirm the write. This, unfortunately, makes the system as a whole quite fragile--in fact, much more fragile (at least with respect to writing) than if you didn't replicate at all (because if any of the servers go off-line, you can't write any more). It also has one more problem: it limits write throughput to the (current) speed of the slowest server, so even if they're all working, unless they're perfectly balanced (unlikely to happen) you're wasting capacity.
To prevent those problems, many systems (including Paxos, if memory serves) use some sort of "voting" based system. That is, you attempt to write to all the servers. You consider the write complete if and only if the majority of servers confirm that they've received the write. Likewise on a read, you attempt to read from all the servers, and you consider a value properly read if and only if the majority of servers agree on the value.
This way, up to one fewer than half the servers can be off-line at any given time, and you can still read and write data. Likewise, if you have a few servers that react a little more slowly than the rest, that doesn't slow down operations overall.
Of course, you need to fill in quite a few details to create a working system--but the fact remains that the basic concept is pretty simple, as outlined above.

Why does process creation using `clone` result in an out-of-memory failure?

I have a process that allocates about 20GB of RAM on a 32GB machine. After some events, I'm streaming the data from the parent process to stdin of the child process. It's mandatory to keep the 20GB of data in the parent process at the point when the child is spawned.
The app is written in Rust and I'm calling Command::new('path/to/command') to create the child process.
When I spawn the child process the operating system is trapping an out-of-memory error.
strace output:
[pid 747] 16:04:41.128377 clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7ff4c7f87b10) = -1 ENOMEM (Cannot allocate memory)
Why does the trap occur? The child process should not consume more than 1GB and exec() is called immediately after clone().
The Problem
When a child process is created by the Rust call, several things happen at a C/C++ level. This is a simplification, but it will help explain the dilemma.
The streams are duplicated (with dup2 or a similar call)
The parent process is forked (with the fork or clone system call)
The forked process executes the child (with call from the execvp family)
The parent and child are now concurrent processes. The Rust call you are currently using appears to be a clone call that is behaving much like a pure fork, so you're 20G x 2 - 32G = 8G short, without considering the space needed by the operating system and anything else that might be running. The clone call is returning with a negative return value and errno is set by the call to ENOMEM errno.
If the architectural solutions of adding physical memory, compressing the data, or streaming it through a process that does not require the entirety of it to be in memory at any one time are not options, then the classic solution is reasonably simple.
Recommendation
Design the parent process to be lean. Then spawn two worker children, one that handles your 20GB need and the other that handles your 1 GB need1. These children can be connected to one another via pipe, file, shared memory, socket, semaphore, signalling, and/or other communication mechanism(s), just as a parent and child can be.
Many mature software packages from Apache httpd to embedded cell tower routing daemons use this design pattern. It is reliable, maintainable, extensible, and portable.
The 32G would then likely suffice for the 20G and 1G processing needs, along with OS and lean parent process.
Although this solution will surely solve your problem, if the code is to be reused or extended later, there may be value in looking into potential process design changes involving data frames or multidimensional slices to support streaming of data and memory requirement reductions.
Memory Overcommit Always
Setting overcommit_memory to 1 eliminates the clone error condition referenced in the question because the Rust call calls the LINUX clone call that reads that setting. But there are several caveats with this solution that point back to the above recommendation as superior, primarily that the value of 1 is dangerous, especially for production environments.
Background
Kernel discussions about OpenBSD rfork and the clone call ensued in the late 1990s and early 2000s. The features stemming from those discussions permit less extreme forking than processes, which is symmetrically like the provision of more extensive independence between pthreads. Some of these discussions have produced extensions to the traditional process spawning that have entered POSIX standardization.
In the early 2000s, Linux Torvalds suggested a flag structure to determine what components of the execution model are shared and what are copied when execution forks, blurring the distinction between processes and threads. From this, the clone call emerged.
Over-committing memory is not discussed much if any in those threads. The design goal was MORE control of the results of a fork rather than the delegation of memory usage optimization to an operating system heuristic, which is what the default setting of overcommit_memory = 0 does.
Caveats
Memory overcommit goes beyond these extensions, adding the complexity of trade-offs of its modes2, design trend caveats3, practical run time limitations4, and performance impacts5.
Portability and Longevity
Additionally, without standardization, the code using memory overcommit may not be portable, and the question of longevity is pertinent, especially when a setting controls the behavior of a function. There is no guarantee of backward compatibility or even some warning of deprication if the setting system changes.
Danger
The linuxdevcenter documentation2 says, "1 always overcommits. Perhaps you now realize the danger of this mode.", and there are other indications of danger with ALWAYS overcommitting 6, 7.
The implementers of overcommit on LINUX, Windows, and VMWare may guarantee reliability, but it is a statistical game that, combined with the many other complexities of process control, may lead to certain unstable characteristics under certain conditions. Even the name overcommit tells us something about its true character as a practice.
A non-default overcommit_memory mode, for which several warnings are issues, but works for the immediate trial of the immediate case may later lead to intermittent reliability.
Predictability and Its Impact on System Reliability and Response Time Consistency
The idea of a process in a UNIX like operating system, from its Bell Labs beginnings, is that a process makes a concrete requests to its container, the operating system. The result both predictable and binary. Either the request is denied or granted. Once granted, the process is given complete control and direct access over the resources until the use of it is relinquished by the process.
The swap space aspect of virtual memory is a breach of this principle that appears as gross deceleration of activity on workstations, when RAM is heavily consumed. For instance, there are times during development when one presses a key and has to wait ten seconds to see the character on the display.
Conclusion
There are many ways to get the most out of physical memory, but doing so by hoping that use of memory allocated will be sparse will likely introduce negative impacts. Performance hits from swapping when overcommit is overused is the well documented example. If you are keeping 20G of data in RAM, this may particularly be the case.
Only allocating what is needed, forking in intelligent ways, using threads, and freeing memory that is surely no longer needed lead to memory thrift without impacting reliability, creating spikes in swap disk usage, and can operate without caveat up to the limits of system resources.
The position of the designer of the Command::new call may be based on this perspective. In this case, how soon after the fork the exec is called is not a determining factor in how much memory is requested during the spawn.
Notes and References
[1] Spawning worker children may require some code refactoring and appear to be too much trouble on a superficial level, but the refactoring may be surprisingly straightforward and significantly beneficial.
[2] http://www.linuxdevcenter.com/pub/a/linux/2006/11/30/linux-out-of-memory.html?page=2
[3] https://www.etalabs.net/overcommit.html
[4] http://www.gabesvirtualworld.com/memory-overcommit-in-production-yes-yes-yes/
[5] https://labs.vmware.com/vmtj/memory-overcommitment-in-the-esx-server
[6] https://github.com/kubernetes/kubernetes/issues/14452
[7] http://linuxtoolkit.blogspot.com/2011_08_01_archive.html

Confusion as to how fork() and exec() work

Consider the following:
Where I'm getting confused is in the step "child duplicate of parent". If you're running a process such as say, skype, if it forks, is it copying skype, then overwriting that process copy with some other program? Moreover, what if the child process has memory requirements far different from the parent process? Wouldn't assigning the same address space as the parent be a problem?
I feel like I'm thinking about this all wrong, perhaps because I'm imagining the processes to be entire programs in execution rather than some simple instruction like "copy data from X to Y".
All modern Unix implementations use virtual memory. That allows them to get away with not actually copying much when forking. Instead, their memory map contains pointers to the parent's memory until they start modifying it.
When a child process exec's a program, that program is copied into memory (if it wasn't already there) and the process's memory map is updated to point to the new program.
fork(2) is difficult to understand. It is explained a lot, read also fork (system call) wikipage and several chapters of Advanced Linux Programming. Notice that fork does not copy the running program (i.e. the /usr/bin/skype ELF executable file), but it is lazily copying (using copy-on-write techniques - by configuring the MMU) the address space (in virtual memory) of the forking process. Each process has its address space (but might share some segments with some other processes, see mmap(2) and execve(2) ....). Since each process has its own address space, changes in the address space of one process does not (usually) affect the parent process. However, processes may have shared memory but then need to synchronize: see shm_overview(7) & sem_overview(7)...
By definition of fork, just after the fork syscall the parent and child processes have nearly equal state (in particular the address space of the child is a copy of the address space of the parent). The only difference being the return value of fork.
And execve is overwriting the address space and registers of the current process.
Notice that on Linux all processes (with a few exceptions, like kernel started processes such as /sbin/modprobe etc) are obtained by fork-ing -from the initial /sbin/init process of pid 1.
At last, system calls -listed in syscalls(2)- like fork are an elementary operation from the application's point of view, since the real processing is done inside the Linux kernel. Play with strace(1). See also this answer and that one.
A process is often some machine state (registers) + its address space + some kernel state (e.g. file descriptors), etc... (but read about zombie processes).
Take time to follow all the links I gave you.

MPI on a multicore machine

My situation is quite simple: I want to run a MPI-enabled software on a single multiprocessor/core machine, let's say 8.
My implementation of MPI is MPICH2.
As I understand I have a few options:
$ mpiexec -n 8 my_software
$ mpiexec -n 8 -hosts {localhost:8} my_software
or I could also specify Hydra to "fork" and not "ssh";
$ mpiexec -n 8 -launcher fork my_software
Could you tell me if there will be any differences or if the behavior will be the same ?
Of course as all my nodes will be on the same machine I don't want "message passing" to be done through the network (even the local loop) but through shared memory. As I understood MPI will figure that out itself and that will be the case for all the three options.
Simple answer:
All methods should lead to the same performance. You'll have 8 processes running on the cores and using shared memory.
Technical answer:
"fork" has the advantage of compatibility, on systems where rsh/ssh process spawning would be a problem. But can, I guess, only start processes locally.
At the end (unless MPI is weirdly configured) all processes on the same CPU will end up using "shared memory", and the launcher or the host specification method should not matter for this. The communication method is handled by another parameter (-channel ?).
Specific syntax of host specification method can permit to bind processes to a specific CPU core, then you might have slightly better/worse performance depending of your application.
If you've got everything set up correctly then I don't see that your program's behaviour will depend on how you launch it, unless that is it fails to launch under one or other of the options. (Which would mean that you didn't have everything set up correctly in the first place.)
If memory serves me well the way in which message passing is implemented depends on the MPI device(s) you use. It used to be that you would use the mpi ch_shmem device. This managed the passing of messages between processes but it did use buffer space and messages were sent to and from this space. So message passing was done, but at memory bus speed.
I write in the past tense because it's a while since I was that close to the hardware that I knew (or, frankly, cared) about low-level implementation details and more modern MPI installations might be a bit, or a lot, more sophisticated. I'll be surprised, and pleased, to learn that any modern MPI installation does, in fact, replace message-passing with shared memory read/write on a multicore/multiprocessor machine. I'll be surprised because it would require translating message-passing into shared memory access and I'm not sure that that is easy (or easy enough to be feasible) for the whole of MPI. I think it's far more likely that current implementations still rely on message-passing across the memory bus through some buffer area. But, as I state, that's only my best guess and I'm often wrong on these matters.

Address Inquiry

Okay so, lets say I have an integer.
When I execute the program, that integer gets an address.
Makes sense.
But, there is many programs out there. Lets see, when creating any game hack, lets say minesweeper I find address of where that data stored and change it.
But... That hack, that simple hack which just changing some address... Works on every computer and every-time.
The question is, that data is getting same address every-time.
And on my computer, there is about 30 exe running now.
Don't other programs want that address ? What If they want that address ? Why that hack works every-time ? Why other programs dont want that very same address ? How its working every-time ?
Every application gets it's own virtual addressing space (4GB on 32 bit machines) to overcome that problem in a multitasking operating system.
Here is a pretty good article covering the subject.
Your "hack" is probably locating a process using something like OpenProcess and editing the memory using WriteProcessMemory. That's why it works on "all" machines.
Basically, you need to read about virtual memory. The purpose of virtual memory is to abstract away the physical address space, and give each process (i.e. each application) its own "virtual" address space, which avoids the problem that you're describing.
If your minesweeper hack consists of manipulating data stored on a specified static address, there's no way it would work on every computer.. Program memory allocation is OS dependent.

Resources