Unix system file tables

Unix system file tables - unix

I am confused about Unix system file tables.
When two or more processes open a file for reading, does the system file table create separate entries for each process or a single entry?
If a single entry is created for multiple processes opening the same file, will their file offsets also be the same?
If process 1 opens file1.txt for reading and process 2 opens the same file file1.txt for writing, will the system file table create one or two entries?

There are three "system file tables": There is a file descriptor table that maps file descriptors (small integers) to entries in the open file table. Each entry in the open file table contains (among other things) a file offset and a pointer to the in-memory inode table. Here's a picture:
(source: rich from www.cs.ucsb.edu now on archive.org)
So there is neither just one file table entry for an open file nor is there just one per process ... there is one per open() call, and it is shared if the file descriptor is dup()ed or fork()ed.
Answering your questions:
When two or more processes open a file for reading, there's an entry in the open file table per open. There is even an entry per open if one process opens the file multiple times.
A single entry is not created in the open file table for different processes opening the same file (but there is just one entry in the in-memory inode table).
If file1.txt is opened twice, in the same or two different processes, there are two different open file table entries (but just one entry in the in-memory inode table).

The same file may be opened simultaneously by several processes, and even by the same process (resulting in several file descriptors for the same file) depending on the file organization and filesystem. Operations on the descriptors like moving the file pointer, or closing it are independent (they do not affect other descriptors for the same file). Operations of the file (like a write) can be seen by operations on the other descriptors (a posterior read can read the written data).
This is from the
open(System call) wiki page

Related

Nifi: Not able to sftp files generated continuously

I am creating a simple nifi pipeline to read file and write the same file to the two different locations. Below is the flow of my pipeline:
1) Read the file from server_1 directory_1
2) copy the file to server_1 directory_2
3) copy the file to server_2 directory_3
A python script is continuously generating csv in server_1 directory_1. I am able to do first and second step, but in third step pipeline is writing only old data. For reading new data I need to empty queue of success_sftp. Below is the screenshot of pipleine:
In the third case it is showing two behaviour:
1) If no csv file is present in the input directory and I run the flow then it will copy the file coming first(Only first files not the file after that) and success_sftp queue will be full after that.
2)If I have csv file(say 10 file) in input directory and I run the flow then it will copy all the csv files(10 files) to output directory and after that queue will be full. For writing more file I need to empty queue.
Kindly assist

Automating- Appending two text files to create 1 Excel file daily

I have two files that come in daily to a shared drive. When they are posted, they come in with the current date as part of the file name. example ( dataset1_12517.txt and dataset2_12517.txt) the next day it posts it will be (dataset1_12617.txt and so on). They are pipe delimited files if that matters.
I am trying to automate a daily merge of these two files to a single excel file that will be overwritten with each merge (file name remains the same) so my tableau dashboard can read the output without having to make a new connection daily. The tricky part is the file names will change daily, but they follow a specific naming convention.
I have access to R Studio. I have not started writing code yet so looking for a place to start or a better solution.

On a Window machine, use the copy or xcopy command lines. There are several variations on how to do it. The jist of it though is that if you supply the right switches, the source file will append to the destination file.
I like using xcopy for this. Supply the destination file name and then a list of source files.
This becomes a batch file and you can run it as a scheduled task or on demand.
This is roughly what it would look it. You may need to check the online docs to choose the right parameter switches.
xcopy C:\SRC\souce_2010235.txt newfile.txt /s
As you play with it, you may even try using a wildcard approach.
xcopy C:\SRC\*.txt newfile.txt /s
See Getting XCOPY to concatenate (append) for more details.

Difference between unlink and rm on unix

Whats the real different between these two commands? Why is the system call to delete a file called unlink instead of delete?

You need to understand a bit about the original Unix file system to understand this very important question.
Unlike other operating systems of its era (late 60s, early 70s) Unix did not store the file name together with the actual directory information (of where the file was stored on the disks.) Instead, Unix created a separate "Inode table" to contain the directory information, and identify the actual file, and then allowed separate text files to be directories of names and inodes. Originally, directory files were meant to be manipulated like all other files as straight text files, using the same tools (cat, cut, sed, etc.) that shell programmers are familiar with to this day.
One important consequence of this architectural decision was that a single file could have more than one name! Each occurrence of the inode in a particular directory file was essentially linking to the inode, and so it was known. To connect a file name to the file's inode (the "actual" file,) you "linked" it, and when you deleted the name from a directory you "unlinked" it.
Of course, unlinking a file name did not automatically mean that you were deleting / removing the file from the disk, because the file might still be known by other names in other directories. The Inode table also includes a link count to keep track of how many names an inode (a file) was known by; linking a name to a file adds one to the link count, and unlinking it removes one. When the link count drops down to zero, then the file is no longer referred to in any directory, presumed to be "unwanted," and only then can it be deleted.
For this reason the "deletion" of a file by name unlinks it - hence the name of the system call - and there is also the very important ln command to create an additional link to a file (really, the file's inode,) and let it be known another way.
Other, newer operating systems and their file systems have to emulate / respect this behavior in order to comply with the Posix standard.

How do the UNIX commands mv and rm work with open files?

If I am reading a file stored on an NTFS filesystem, and I try to move/rename that file while it is still being read, I am prevented from doing so. If I try this on a UNIX filesystem such as EXT3, it succeeds, and the process doing the reading is unaffected. I can even rm the file and reading processes are unaffected. How does this work? Could somebody explain to me why this behaviour is supported under UNIX filesystems but not NTFS? I have a vague feeling it has to do with hard links and inodes, but I would appreciate a good explanation.

Unix filesystems use reference counting and a two-layer architecture for finding files.
The filename refers to something called an inode, for information node or index node. The inode stores (a pointer to) the file contents as well as some metadata, such as the file's type (ordinary, directory, device, etc.) and who owns it.
Multiple filenames can refer to the same inode; they are then called hard links. In addition, a file descriptor (fd) refers to an inode. An fd is the type of object a process gets when it opens a file.
A file in a Unix filesystem only disappears when the last reference to it is gone, so when there are no more names (hard links) or fd's referencing it. So, rm does not actually remove a file; it removes a reference to a file.
This filesystem setup may seem confusing and it sometimes poses problems (esp. with NFS), but it has the benefit that locking is not necessary for a lot of applications. Many Unix programs also use the situation to their advantage by opening a temporary file and deleting it immediately after. As soon as they terminate, even if they crash, the temporary file is gone.

On unix, a filename is simply a link to the actual file(inode). Opening a file also creates a (temporary) link to the actual file. When all links to a file have disappeared (rm and close()) then the file is removed.
On NTFS, logically the filename is the file. There's no indirection layer from the filename to the file metainfo, they're the same object. If you open it, it's in use and can't be removed, just as the actual file(inode) on unix can't be removed while it's in use.
Unix: Filename ➜ FileInfo ➜ File Data
NTFS: FileName + FileInfo ➜ File Data

How does file system on UNIX find files?

Suppose that a request is made to ls somefile. How does the file system in UNIX handle this request from algorithmic perspective? Is that a O(1) query or O(log(N)) depending on files say starting in current directory node, or is it a O(N) linear search, or is that a combination depending on some parameters?

It can be O(n). Classic Unix file systems, based on the old school BSD fast file system and the like, store files as inode numbers, and their names are assigned at the directory level, not at the file level. This allows you have to the same file present in multiple locations at the same time, via hard links. As such, a "directory" in most Unix systems is just a file that lists filenames and inode numbers for all the files stored "in" that directory.
Searching for a particular filename in a directory just means opening that directory file and parsing through it until you find the filename's entry.
Of course, there's many different file systems available for Unix systems these days, and some will have completely differnet internal semantics for finding files, so there's no one "right" answer.

Its O(n) since the file systems has to read it off phyical media initially, but Buffer Caches will increase that significantly based on the Virtual File System (VFS) implementation on your flavor of *nix. (Notice how the first time you access a file its slower than the second time you execute the exact same command?)
To learn more read IBM's article on the Anatomy of the Unix file system.

Typical flow for a program like ls would be
Opendir on the current path.
Readdir for the current path.
Filter the entries returned by the OpenDir through filter provided on the command line. So typically O(n)
This is the generic flow, however there are many optimizations in place for special and frequent cases (;like caching of inode numbers of recent and frequent paths.
Also it depends on how directoy file are organized. In unix it is based on time of creation forcing to read every entry and increasing the look-up time to O(n). In NTFS equivalent of directory files are sorted based on name.

I can't answer your question. Maybe if you take a peak into the source code, you could answer your question yourself and explain us how it works.
ls.c
ls.h

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex