How does an open() system call return a file descriptor? - unix

I want to know what happens in the kernel when an open() system call is invoked? How does it return a file descriptor for a file?

The kernel creates internally a structure containing additional informations about the file you did just open. This structure holds informations such as the inode number, the name of the file on the file system, its size, its associated superblock, etc ...
In fact, within the kernel, it is the VFS (Virtual File System) that handles I/O operations on a file will it be local (on you hard disk) or remote (located on an FTP server for instance like ftpfs does).
Every file systems on GNU/Linux implements the same mechanisms of opening/reading/writing/closing files. This ensures every developers don't have to bother about what kind of file they are trying to access, no matter what kind of file you are interacting with, the same open(), read() ... APIs can be used. You can find additional informations on what the VFS is here and here (great article by IBM).
Finally, each file descriptor that is returned by let's say open is relative to your program, so the first file you might be opening will be associated to the file descriptor 3 and so on ... It is possible to find out what file descriptors are binded to each process on many GNU/Linux distributions via /proc/{pid_of_your_process}.

If you really want to dive deep, you can browse the source for many unix variants. For linux, check out http://lxr.linux.no/#linux+v3.9/fs/open.c -- search for SYSCALL_DEFINE3(open, to get to the actual "open" syscall.

The kernel:
looks for the file (hard drive, usb, named pipes, standard streams, ...)
if everything went well, saves itself a descriptor that you opened the file
returns you a descriptor
if you close() or the process exits, releases it's info about your open()

Related

Is it possible to open a file descriptor by reading /proc/<pid>/fd?

I am writing a shell application. The shell will fork of processes and start application through execve.
I would like to reroute stdin in the new process to an unnamed pipe. This pipe would be shared between the application and the parent.
To my understanding the same user and the root can check the opened file descriptors of the spawned processes. This can be done by accessing /proc//fd/ .
Is it possible as root or an application running as the same user, to open the given file descriptor and eavesdrop on it?
TLDR: I am interested how safe are unnamed pipes, and if it is possible to eavesdrop on them.

How are stdin and stdout made unique to the process?

Stdin and stdout are single files that are shared by multiple processes to take in input from the users. So how does the OS make sure that only the input give to a particular program is visible in the stdin for than program?
Your assumption that stdin/stdout (while having the same logical name) are shared among all processes is wrong at best.
stdin/stdout are logical names for open files that are forwarded (or initialized) by the process that has started a given process. Actually, with the standard fork-and-exec pattern the setup of those may occur already in the new process (after fork) before exec is being called.
stdin/stdout usually are just inherited from parent. So, yes there exist groups of processes that share stdin and/or stdout for a given filenode.
Also, as a filedescriptor may be a side of a pipe, you need not have file from a filesystem (or a device node) linked to any of the well known standard channels (you also should include stderr into your considerations).
The normal way of setup is:
the parent (e.g. your shell) is calling fork
the forked process (child) is setting up environment, standard I/O channels and anything else.
the child then executes exec to overlay the process with the target image to be executed.
When setting up: it either will keep the existing channels or replace them with new ones e.g. creating a pipe and linking the endpoints appropriately. (To be honest, creating the pipe need to happen before the fork in that simplified description)
This way, most of the process have their own I/O channels.
Nevertheless, multiple processes may write into a channel they are connected to (have a valid filedescriptor to). When reading each junk of data (usually lines with terminals or blocks with files) is being read by a single reader only. So if you have several (running) processes reading from a terminal as stdin, only one will read your typing, while the other(s) will not see this typing at all.

MPI one-sided file I/O

I have some questions on performing File I/Os using MPI.
A set of files are distributed across different processes.
I want the processes to read the files in the other processes.
For example, in one-sided communication, each process sets a window visible to other processors. I need the exactly same functionality. (Create 'windows' for all files and share them so that any process can read any file from any offset)
Is it possible in MPI? I read lots of documentations about MPI, but couldn't find the exact one.
The simple answer is that you can't do that automatically with MPI.
You can convince yourself by seeing that MPI_File_open() is a collective call taking an intra-communicator as first argument and returning a file handler to the opened file as last argument. In this communicator, all processes open the file and therefore, all processes must see the file. So unless a process sees a file, it cannot get a MPI_file handler to access it.
Now, that doesn't mean there's no solution. A possibility could be to do by hand exactly what you described, namely:
Each MPI process opens individually the file they see and are responsible of; then
Each of theses processes reads this local file into a buffer;
Theses individual buffers are all exposed, using either a global MPI_Win memory windows, or several individual ones, ready for one-sided read accesses; and finally
All read accesses to any data that were previously stored in these individual local files, are now done through MPI_Get() calls using the memory window(s).
The true limitation of this approach is that it requires to fully read all of the individual files, therefore, you need to have sufficient memory per node for storing each of them. I'm well aware that this is a very very big caveat that could just make the solution completely impractical. However, if the memory is sufficient, this is an easy approach.
Another even simpler solution would be to store the files into a shared file system, or having them all copied on all local file systems. I imagine this isn't an option since the question wouldn't have been asked otherwise...
Finally, in last resort, a possibility I see would be to dedicate a MPI process (or an OpenMP thread of a MPI process) per node to serve each files. This process would just act as a "file server", answering "read" request coming from the other MPI processes, and serving them by reading the requested data from the file, and sending it back via MPI. It's a bit lengthy to write, but it should work.

Read/write data using mmap for encrypted file system

I am working on an encrypted filesystem that encrypts data just before writing it to disk and decrypts it right after reading from disk. Any file in disk is useless if not decrypted first. so far I changed standard read and write methods that filesystem reperesents.
the problem begins with mmap used for memory mapping files. for example in ext4 filesystem, as far as I know it does not use standard I/O so should be encrypted/decrypted just like read/write system calls. So how can I decrypt data when its being read from disk and encrypt it when kernel wants to update memory mapped files?
I want to stay in my filesystem specific module if it is possible.
UPDATE: read/write works in terminal perfectly. but:
I can not execute binary files in encrypted partition.
when I copy files using a GUI based filesystem (pcmanfm for example), result file is corrupted.
so should I edit any other system calls like I did with read/write?

0-copy inter-process communication on Unix without using the filesystem

If I have to move a moderate amount of memory between two processes, I can do the following:
create a file for writing
ftruncate to desired size
mmap and unlink it
use as desired
When another process requires that data, it:
connects to the first process through a unix socket
the first process sends the fd of the file through a unix socket message
mmap the fd
use as desired
This allows us to move memory between processes without any copy - but the file created must be on a memory-mounted filesystem, otherwise we might get a disk hit, which would degrade performance. Is there a way to do something like that without using a filesystem? A malloc-like function that returned a fd along with a pointer would do it.
[Edit] Having a file descriptor provides also a reference count mechanism that is maintained by the kernel.
Is there anything wrong with System V or POSIX shared memory (which are somewhat different, but end up with the same result)? With any such system, you have to worry about coordination between the processes as they access the memory, but that is true with memory-mapped files too.

Resources