Difference between unlink and rm on unix - unix

Whats the real different between these two commands? Why is the system call to delete a file called unlink instead of delete?

You need to understand a bit about the original Unix file system to understand this very important question.
Unlike other operating systems of its era (late 60s, early 70s) Unix did not store the file name together with the actual directory information (of where the file was stored on the disks.) Instead, Unix created a separate "Inode table" to contain the directory information, and identify the actual file, and then allowed separate text files to be directories of names and inodes. Originally, directory files were meant to be manipulated like all other files as straight text files, using the same tools (cat, cut, sed, etc.) that shell programmers are familiar with to this day.
One important consequence of this architectural decision was that a single file could have more than one name! Each occurrence of the inode in a particular directory file was essentially linking to the inode, and so it was known. To connect a file name to the file's inode (the "actual" file,) you "linked" it, and when you deleted the name from a directory you "unlinked" it.
Of course, unlinking a file name did not automatically mean that you were deleting / removing the file from the disk, because the file might still be known by other names in other directories. The Inode table also includes a link count to keep track of how many names an inode (a file) was known by; linking a name to a file adds one to the link count, and unlinking it removes one. When the link count drops down to zero, then the file is no longer referred to in any directory, presumed to be "unwanted," and only then can it be deleted.
For this reason the "deletion" of a file by name unlinks it - hence the name of the system call - and there is also the very important ln command to create an additional link to a file (really, the file's inode,) and let it be known another way.
Other, newer operating systems and their file systems have to emulate / respect this behavior in order to comply with the Posix standard.

Related

Check the hash of files by saltstack when copying

I'm interesting in how saltstack checks
if file already exists
if file is not corrupted
when it copies files.
Does it check hash sum of file and its content, or only file names? Especially when file.recurse state is used for copying directories.
Does anybody know how to iterate through the directory content and check hash sum for files?
I understand that here is a few questions, but it is important to imagine the big picture of copying process.
Salt compares the hash sums of the files to decide if a download is needed.

Unix command to remove a file that can't be retrieved at any cost

rm command removes only the reference and not the actual data from the disk, so that can be retrieved later, is there any command that delete the reference and the data at the same time.
It really depends on what you need.
If need to reclaim the storage space without waiting for all the processes that hold the file open to close it or die, invoke truncate -s 0 FILENAME to free the data, and then remove the file with a regular rm FILENAME. The two operations will not be atomic, though, and the programs that have already opened the file can fail in various ways (including a segmentation fault, if they have mapped some portions of the file into memory). However, given that you intend to delete both the file and its contents, there is no general way to prevent the programs that depend on the contents from failing.
If your goal is for the data to not be retrievable with forensic analysis after removal, use a command such as shred, which is specifically designed to overwrite the data before removing the file. And - pay close attention to the limitations of such tools before trusting them to reliably destroy sensitive data.
If, as your comment suggests, you are on OSX, you can use srm to do "secure removals".
SRM(1) SRM(1)
NAME
srm - securely remove files or directories
SYNOPSIS
srm [OPTION]... FILE...
DESCRIPTION
srm removes each specified file by overwriting, renaming, and truncat-
ing it before unlinking. This prevents other people from undeleting or
recovering any information about the file from the command line.
Online manpage is here.
Alternatively, shred is available within the GNU coreutils, which you can easily install on OS X using homebrew, with the command
brew install coreutils

Executing 'mv A B': Will the 'inode' be changed?

If we execute a command:
mv A B
then what will happen to the fields in the inode of file A? Will it change?
I don't think that it should change just by changing the name of the file, but I'm not sure.
It depends at least partially on what A and B are. If you're moving between file systems, the inode will almost certainly be different.
Simply renaming the file on the same system is more likely to keep the same inode simply because the inode belongs to the data rather than the directory entry and efficiency would lead to that design. However, it depends on the file system and is in no way mandated by standards.
For example, there may be a versioning file system with the inode concept that gives you a new inode because it wants to track the name change.
It depends.
There is a nice example on this site which shows that the inode may stay the same. But I would not rely on this behaviour, I doubt that it is specified in any standard.

How do the UNIX commands mv and rm work with open files?

If I am reading a file stored on an NTFS filesystem, and I try to move/rename that file while it is still being read, I am prevented from doing so. If I try this on a UNIX filesystem such as EXT3, it succeeds, and the process doing the reading is unaffected. I can even rm the file and reading processes are unaffected. How does this work? Could somebody explain to me why this behaviour is supported under UNIX filesystems but not NTFS? I have a vague feeling it has to do with hard links and inodes, but I would appreciate a good explanation.
Unix filesystems use reference counting and a two-layer architecture for finding files.
The filename refers to something called an inode, for information node or index node. The inode stores (a pointer to) the file contents as well as some metadata, such as the file's type (ordinary, directory, device, etc.) and who owns it.
Multiple filenames can refer to the same inode; they are then called hard links. In addition, a file descriptor (fd) refers to an inode. An fd is the type of object a process gets when it opens a file.
A file in a Unix filesystem only disappears when the last reference to it is gone, so when there are no more names (hard links) or fd's referencing it. So, rm does not actually remove a file; it removes a reference to a file.
This filesystem setup may seem confusing and it sometimes poses problems (esp. with NFS), but it has the benefit that locking is not necessary for a lot of applications. Many Unix programs also use the situation to their advantage by opening a temporary file and deleting it immediately after. As soon as they terminate, even if they crash, the temporary file is gone.
On unix, a filename is simply a link to the actual file(inode). Opening a file also creates a (temporary) link to the actual file. When all links to a file have disappeared (rm and close()) then the file is removed.
On NTFS, logically the filename is the file. There's no indirection layer from the filename to the file metainfo, they're the same object. If you open it, it's in use and can't be removed, just as the actual file(inode) on unix can't be removed while it's in use.
Unix: Filename ➜ FileInfo ➜ File Data
NTFS: FileName + FileInfo ➜ File Data

How does file system on UNIX find files?

Suppose that a request is made to ls somefile. How does the file system in UNIX handle this request from algorithmic perspective? Is that a O(1) query or O(log(N)) depending on files say starting in current directory node, or is it a O(N) linear search, or is that a combination depending on some parameters?
It can be O(n). Classic Unix file systems, based on the old school BSD fast file system and the like, store files as inode numbers, and their names are assigned at the directory level, not at the file level. This allows you have to the same file present in multiple locations at the same time, via hard links. As such, a "directory" in most Unix systems is just a file that lists filenames and inode numbers for all the files stored "in" that directory.
Searching for a particular filename in a directory just means opening that directory file and parsing through it until you find the filename's entry.
Of course, there's many different file systems available for Unix systems these days, and some will have completely differnet internal semantics for finding files, so there's no one "right" answer.
Its O(n) since the file systems has to read it off phyical media initially, but Buffer Caches will increase that significantly based on the Virtual File System (VFS) implementation on your flavor of *nix. (Notice how the first time you access a file its slower than the second time you execute the exact same command?)
To learn more read IBM's article on the Anatomy of the Unix file system.
Typical flow for a program like ls would be
Opendir on the current path.
Readdir for the current path.
Filter the entries returned by the OpenDir through filter provided on the command line. So typically O(n)
This is the generic flow, however there are many optimizations in place for special and frequent cases (;like caching of inode numbers of recent and frequent paths.
Also it depends on how directoy file are organized. In unix it is based on time of creation forcing to read every entry and increasing the look-up time to O(n). In NTFS equivalent of directory files are sorted based on name.
I can't answer your question. Maybe if you take a peak into the source code, you could answer your question yourself and explain us how it works.
ls.c
ls.h

Resources