Check the hash of files by saltstack when copying

Check the hash of files by saltstack when copying - salt-stack

I'm interesting in how saltstack checks
if file already exists
if file is not corrupted
when it copies files.
Does it check hash sum of file and its content, or only file names? Especially when file.recurse state is used for copying directories.
Does anybody know how to iterate through the directory content and check hash sum for files?
I understand that here is a few questions, but it is important to imagine the big picture of copying process.

Salt compares the hash sums of the files to decide if a download is needed.

Related

Recoverable file deletion in R

According to these questions:
Automatically Delete Files/Folders
how to delete a file with R?
the two ways to delete files in R are file.remove and unlink. These are both permanent and non-recoverable.
Is there an alternative method to delete files so they end up in the trash / recycle bin?

I wouldn't know about a solution that is fully compatible with Windows' "recycle bin", but if you're looking for something that doesn't quite delete files, but prevents them from being stored indefinitely, a possible solution would be to move files to the temporary folder for the current session.
The command tempdir() will give the location of the temporary folder, and you can just move files there - to move files, use file.rename().
They will remain available for as long as the current session is running, and will automatically be deleted afterwards . This is less persistent than the classic recycle bin, but if that's what you're looking for, you probably just want to move files to a different folder and delete it completely when you're done.
For a slightly more consistent syntax, you can use the fs package (https://github.com/r-lib/fs), and its fs::path_temp() and fs::file_move().

Overcoming inode limitation

What's the best practise for storing a large (expanding) number of small files on a server, without running into inode limitations?
For a project, I am storing a large number of small files on a server with 2TB HD space, but my limitation is the 2560000 allowed inodes. Recently the server used up all the inodes and was unable to write new files. I subsequently moved some files into databases, but others (images and json files remain on the drive). I am currently at 58% inode usage, so a solution is required imminently.
The reason for storing the files individually is to limit the number of database calls. Basically the scripts will check if the file exists and if so, then return results dependently. Performance wise this makes sense for my application, but as stated above it has limitations.
As I understand it does not help to move the files into sub-directories, because each inode points to a file (or a directory file), so in fact I would just use up more inodes.
Alternatively I might be able to bundle the files together in an archive type of file, but that will require some sort of indexing.
Perhaps I am going about this all wrong, so any feedback is greatly appreciated.

On the advice of arkascha I looked into loop devices and found some documentation about losetup. Remains to be tested.

Difference between unlink and rm on unix

Whats the real different between these two commands? Why is the system call to delete a file called unlink instead of delete?

You need to understand a bit about the original Unix file system to understand this very important question.
Unlike other operating systems of its era (late 60s, early 70s) Unix did not store the file name together with the actual directory information (of where the file was stored on the disks.) Instead, Unix created a separate "Inode table" to contain the directory information, and identify the actual file, and then allowed separate text files to be directories of names and inodes. Originally, directory files were meant to be manipulated like all other files as straight text files, using the same tools (cat, cut, sed, etc.) that shell programmers are familiar with to this day.
One important consequence of this architectural decision was that a single file could have more than one name! Each occurrence of the inode in a particular directory file was essentially linking to the inode, and so it was known. To connect a file name to the file's inode (the "actual" file,) you "linked" it, and when you deleted the name from a directory you "unlinked" it.
Of course, unlinking a file name did not automatically mean that you were deleting / removing the file from the disk, because the file might still be known by other names in other directories. The Inode table also includes a link count to keep track of how many names an inode (a file) was known by; linking a name to a file adds one to the link count, and unlinking it removes one. When the link count drops down to zero, then the file is no longer referred to in any directory, presumed to be "unwanted," and only then can it be deleted.
For this reason the "deletion" of a file by name unlinks it - hence the name of the system call - and there is also the very important ln command to create an additional link to a file (really, the file's inode,) and let it be known another way.
Other, newer operating systems and their file systems have to emulate / respect this behavior in order to comply with the Posix standard.

How do I Download efficiently with rsync?

A couple of questions related to one theme: downloading efficiently with Rsync.
Currently, I move files from an 'upload' folder onto a local server using rsync. Files to be moved are often dumped there, and I regularly run rsync so the files don't build up. I use '--remove-source-files' to remove files that have been transferred.
1) the '--delete' options that remove destination files have various options that allow you to choose when to remove the files. This would be handly for '--remove-source-files' since is seems that, by default, rsync only removes the files after all files have been transferred, rather than after each file; Othere than writing a script to make rsync transfer files one-by-one, is there a better way to do this?
2) on the same problem, if a large (single) file is transferred, it can only be deleted after the whole thing has been sucessfully moved. It strikes me that I might be able to use 'split' to split the file up into smaller chunks, to allow each to be deleted as the file downloads; is there a better way to do this?
Thanks.

How does file system on UNIX find files?

Suppose that a request is made to ls somefile. How does the file system in UNIX handle this request from algorithmic perspective? Is that a O(1) query or O(log(N)) depending on files say starting in current directory node, or is it a O(N) linear search, or is that a combination depending on some parameters?

It can be O(n). Classic Unix file systems, based on the old school BSD fast file system and the like, store files as inode numbers, and their names are assigned at the directory level, not at the file level. This allows you have to the same file present in multiple locations at the same time, via hard links. As such, a "directory" in most Unix systems is just a file that lists filenames and inode numbers for all the files stored "in" that directory.
Searching for a particular filename in a directory just means opening that directory file and parsing through it until you find the filename's entry.
Of course, there's many different file systems available for Unix systems these days, and some will have completely differnet internal semantics for finding files, so there's no one "right" answer.

Its O(n) since the file systems has to read it off phyical media initially, but Buffer Caches will increase that significantly based on the Virtual File System (VFS) implementation on your flavor of *nix. (Notice how the first time you access a file its slower than the second time you execute the exact same command?)
To learn more read IBM's article on the Anatomy of the Unix file system.

Typical flow for a program like ls would be
Opendir on the current path.
Readdir for the current path.
Filter the entries returned by the OpenDir through filter provided on the command line. So typically O(n)
This is the generic flow, however there are many optimizations in place for special and frequent cases (;like caching of inode numbers of recent and frequent paths.
Also it depends on how directoy file are organized. In unix it is based on time of creation forcing to read every entry and increasing the look-up time to O(n). In NTFS equivalent of directory files are sorted based on name.

I can't answer your question. Maybe if you take a peak into the source code, you could answer your question yourself and explain us how it works.
ls.c
ls.h

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex