how does msync() work? - unix

I use mmap to map file F to block B, and then I only write one byte of B.
If I call msync() for B with MS_SYNC, does the OS write all the block to F? Or it only writes the one byte modified to F?

This is OS- and architecture-specific, but most likely only the dirty page will be written to disk.

What does the man page on your particular system say? If it's not open source, that's about the best you have to go on, unless you have can find more detailed documentation for your UNIX platform.
On at least one system, man msync says:
The msync() system call writes modified whole pages back to the
filesystem and updates the file modification time. Only those pages
containing addr and len-1 succeeding locations will be examined.

Related

Can the __LINKEDIT segment of a Mach-O executable be moved

In a Mach-O executable, I am trying to increase the size of the __LLVM segment that precedes the __LINKEDIT segment (with a home-grown tool). I am considering two strategies: (a) move the __LLVM segment to after the __LINKEDIT segment, producing a file that is not what ld would create (now with a gap and section addresses out of order), and (b) move the __LINKEDIT segment to allow resizing of the __LLVM segment that precedes it. I need the result to be accepted for downstream processing, e.g. generating an .ipa file or sending to the App Store.
This question is about my assumptions and the viability of these approaches. Specifically, what are the potential pitfalls of each that might lead them to fail?
I implemented the first approach (a) is understood by segedit's -extract option, but its -replace option complains that the segments are out of order. I append a new segment to the file and update the address and length values in the corresponding load command to refer to this new segment data (both in the file and the destination memory). This might be fine, as long as the other downstream processing will accept the result (still to check; e.g. any local signature is likely invalidated).
The second approach (b) would seem cleaner, as long as there are no references into the __LINKEDIT segment, which I guess contains linking information (symbol tables etc., rather than code). I have not tried this yet, though it seems to be a foregone conclusion that segedit will be happy with the result, which may suggest other processing might also be happier. Are there likely to be any references that are invalidated due to simply moving this segment? I am guessing that I will have to update further load commands (they seem to reference into the __LINKEDIT segment), which I have not examined, but this should be fairly straightforward.
EDIT: Replaced my confused use of "section" with "segment" (mentioned in answer).
ADDED: Context is where I have no control of generating the original executable. I need to post-process it, essentially performing a 'segedit -replace' process, wherein the a section in the segment is to be replaced with a section that is larger than space previously allocated for the segment.
RUN-ON clarifying question: It seems from the answer that moving the __LINKEDIT segment will break it. Can this be fixed by adjusting load commands only (e.g. LC_DYLD_INFO_ONLY, LC_LOAD_DYLINKER, LC_LOAD_DYLIB), not data in any segments? I am not yet familiar with these load commands, and would like to know whether to pursue this.
So basically the segments and sections describe how the physical file maps onto virtual memory.
As I mentioned in my previous iteration of the answer there are limitations on the segments order:
__TEXT section must start at executable physical file offset 0
__LINKEDIT section must not start at physical file offset 0
__LINKEDIT's File Offset + File Size should be equal to physical executable size (this implies __LINKEDIT being the last segment). Otherwise code signing won't work.
__DYLD_INFO_ONLY contains file offsets to dyld loading bind opcodes for:
rebase
bind at load
weak bind
lazy bind
export
For each kind there is file offset and size entry in __DYLD_INFO_ONLY describing the data in file that matches __LINKEDIT (in a "regular" ld linked executable). __DYLD_INFO_ONLY does not use any segment & section information from __LINKEDIT directly, the file offsets and sizes are enough.
EDIT also as mentioned in #kirelagin answer here
"Apparently, the new version of dyld from 10.12 Sierra performs a check that previous versions did not perform: it makes sure that the LC_SYMTAB symbols table is entirely within the __LINKEDIT segment."
I assume since you want to inflate the size of the preceding __LLVM segment you would also want some extra data in the file itself. Typically data described by __LINKEDIT (i.e. not the segment & sections themselves, but the actual data) won't use 100% of it's space so it could be modified to start "later" and occupy less space.
A tool called jtool by Jonathan Levin could probably do it for you.
I know this is an old question, but I solved this problem while solving another problem.
define the slide amount, this must be page-aligned, so I choose 0x4000.
add the slide amount to the relevant load commands, this includes but is not limited to:
__LINKEDIT segment (duh)
dyld_info_command
symtab_command
dysymtab_command
linkedit_data_commands
physically move the __LINKEDIT in the file.

System V Unix and swap area

So, I have had some problem with understanding when I started to learn mechanism of virtual memory in Unix System. So, there is structure proc that contains comprehensive information about any process in a system (I deem that proc has been replaced to tusk_struct,but conception is actually today). struct proc consists address that directs to struct as that directs to list of struct seg.
Sorry for my draft! Roughly it looks like on the picture. So, my answer affects virtual memory. There is segvn_data that is linked with vnode. vnode composes part of File System and take it upon itself of any work with file, including changing on a hard drive. Is that right that one of segment can process such pages that is mapped by vnode? Why should they both use swap area? What is responsible swap area?
PS: when does segment need to change swap area?
PPS: all processes do need to have swap area? When vnode requires that segment maps in swap area?

Write system call and blocking the process

In UNIX: read system call blocks the process until it is done.
How does write system call behaves? does it block the process when it is writing on the disk?
With write system call I mean write(fd, bf, nbyte) procedure call.
No, it only blocks the process until the content of the buffer is copied to kernel space. This is usually very short time, but there are some cases where it may wait for some disk operations:
If there are no free pages, some must be freed. If there are clean pages, their content can be discarded (as it is just copy from disk), but if there are not, some pages must be laundered, which involves write. Since pages are laundered automatically after few seconds, this almost never happens if you have enough memory.
If the write is to the middle of the file, the surrounding content may need to be read, because page cache has page granularity (aligned 4 KiB blocks on most platforms). This happens rarely because it is rare to update file without reading it and if you read it first, the content is cached already.
If you want to wait until the data actually hit the plates, you need to follow up with fsync(2).

Why does System V shared memory have separate get and attach functions?

Using System V shared memory IPC requires calls to the following two functions:
int shmget(key_t key, size_t size, int shmflg);
void *shmat(int shmid, const void *shmaddr, int shmflg);
Why are they designed to be separate, instead of having a single function that accepts these arguments, performs both functions and simply returns the address?
We can consider files as an analogy. open on a string (the file path) gives us a file descriptor, and we use that to read/write from the file. We close on the file descriptor when we're done. This design seems natural, we don't have to open with a string to get a descriptor, and then attach to the descriptor.
As an example of what I have in mind, take a look at the FreeBSD sendmail shared memory implementation.
This kind of separation (shm_open and mmap) also exists with POSIX shared memory, but the reason was that mmap existed before shm_open was implemented and could be reused, and mmap requires a descriptor (source: UNIX Network Programming Vol. 2, R. Stevens, chapter 13, page 326).
Shared memory is probably one of the fastest ways of allowing for IPC as data need not be copied, the problem associated with it though is synchronizing access between multiple threads. You could do this using semaphores or record locks , we end up using the later in unix fro shared memory even though they are not as efficient as they are simple, the system cleans up well, and you don't need some of the bling that semaphores bring along.
Lets look into how these work to understand why they are implemented as such.
In comes the shmid_ds used by the linux kernel (http://www.tldp.org/LDP/lpg/node68.html)
the shm_nattch is the unsigned int counter for current attaches. shmget gets you an shm id and sets stuff like the ipc_perm , dates, pid, atime ctime, request of the segment size (shm_segsz)
next the shmctl kicks in and does stuff for ipc using IPC_STAT, IPC_RMID, IPC_SET like setting perms, getting or removing shm_id for a segment or even locking or unlocking it.
Once the segment is ready shmat is used by a process to attach to its address space, depending on the flags and address parameters. Once it attaches the kernel increments the shm_nattch. When detaching we call shmdt to detach . Removal of the identifier and the associated data structure is not automated some process has to do this calling shmctl with the IPC_RMID and depending on shm_perm
As you can see this is all very similar to how one would use semaphores and the implementation makes sense.
One possible reason I could think of is this:
(From the manpage of shmget)
After a fork(2) the child inherits the attached shared memory segments.
After an execve(2) all attached shared memory segments are detached from the process.
Upon _exit(2) all attached shared memory segments are detached from the process.
Well, technically attaching and detaching is basic reference counting on the shared memory segment that is reserved during shmget.
The functionalities of allocating the shared memory segment, via shmget and reference counting them (up or down, via shmat and shmdt respectively), are separate so that, code can be reused during fork and exec.
If they were both packed into the same function, you would anyways need a separate function, which just does reference counting (to be invoked during fork/exec). So, I think this design is simply to promote code reuse, and avoid code duplication.

Techniques for infinitely long pipes

There are two really simple ways to let one program send a stream of data to another:
Unix pipe, or TCP socket, or something like that. This requires constant attention by consumer program, or producer program will block. Even increasing buffers their typically tiny defaults, it's still a huge problem.
Plain files - producer program appends with O_APPEND, consumer just reads whatever new data became available at its convenience. This doesn't require any synchronization (as long as diskspace is available), but Unix files only support truncating at the end, not at beginning, so it will fill up disk until both programs quit.
Is there a simple way to have it both ways, with data stored on disk until it gets read, and then freed? Obviously programs could communicate via database server or something like that, and not have this problem, but I'm looking for something that integrates well with normal Unix piping.
A relatively simple hand-rolled solution.
You could have the producer create files and keep writing until it gets to a certain size/number of record, whatever suits your application. The producer then closes the file and starts a new one with an agreed naming algorithm.
The consumer reads new records from a file then when it gets to the agreed maximum size closes and unlinks it and then opens the next one.
If your data can be split into blocks or transactions of some sort, you can use the file method for this with a serial number. The data producer would store the first megabyte of data in outfile.1, the next in outfile.2 etc. The consumer can read the files in order and delete them when read. Thus you get something like your second method, with cleanup along the way.
You should probably wrap all this in a library, so that from the applications point of view this is a pipe of some sort.
You should read some documentation on socat. You can use it to bridge the gap between tcp sockets, fifo files, pipes, stdio and others.
If you're feeling lazy, there's some nice examples of useful commands.
I'm not aware of anything, but it shouldn't be too hard to write a small utility that takes a directory as an argument (or uses $TMPDIR); and, uses select/poll to multiplex between reading from stdin, paging to a series of temporary files, and writing to stdout.

Resources