Unison preserve directory time - directory

In the Unison (two way file syncing program) manual, when you set the -times=true preference, only the time stamps of the files are kept unchanged, not the directores:
From the Manual:
times When this flag is set to true, file modification times (but not directory modtimes) are propagated.
I wanted to see if there is any way the directory time could also be preserved during a sync?

After looking through the Unison manual, it appears that that answer to this question is No.

This open issue says that it's not possible, and non-trivial to implement.
You could unison first, then rsync -r --times --size-only --existing afterwards to sync just the times on paths which already exist.

Related

rsync doesn't copy *only* modifications

I'm using rsync to backup my files. I choose rysnc because it (should) use modification times to determine if changes have been made and if files need to be updated.
I started my backup (from my computer system (debian) to a portable external hard drive) with this command:
rsync -avz --update --delete --stats --progress --exclude-from=/home/user/scripts/ExclusionRSync --backup --backup-dir=/media/user/hdd/backups/deleted-files /home/user/ /media/user/hdd/backups/backup_user
It worked well and took a lot of time. I believed the second time would be very quick (since I didn't modify files). Unfortunately, the 2nd, 3th, 4th, ... times took as long as the first one. I still see all my files being copied even if these files are already in my portable hard drive.
I don't understand why rsync doesn't copy only modifications (rsync is known to be efficient and only copy changes and I specificly call --update option).
A side effect of this problem is that all files are moved to my backup dir (deleted-filed) as soon as they are transfered. Indeed, rsync delete the previous file before to copy the same file during each update...
I found the solution reading an answer on Serverfault.SE. The Fat filesystem was messing with timestamps:
FAT doesn't track modification times on files as precisely as, say
ext3 (FAT is only precise to within a 2 second window). This leads to
particularly nasty behavior with rsync as it will sometimes decide
that the original files is newer or older than the backup file by
enough that it needs to re-copy the data or at least re-check the
hashes. All in all, it makes for very poor performance on backups. If
you must stick with FAT, look into rsync's --size-only and
--modify-window flags as workarounds.

Unix command to remove a file that can't be retrieved at any cost

rm command removes only the reference and not the actual data from the disk, so that can be retrieved later, is there any command that delete the reference and the data at the same time.
It really depends on what you need.
If need to reclaim the storage space without waiting for all the processes that hold the file open to close it or die, invoke truncate -s 0 FILENAME to free the data, and then remove the file with a regular rm FILENAME. The two operations will not be atomic, though, and the programs that have already opened the file can fail in various ways (including a segmentation fault, if they have mapped some portions of the file into memory). However, given that you intend to delete both the file and its contents, there is no general way to prevent the programs that depend on the contents from failing.
If your goal is for the data to not be retrievable with forensic analysis after removal, use a command such as shred, which is specifically designed to overwrite the data before removing the file. And - pay close attention to the limitations of such tools before trusting them to reliably destroy sensitive data.
If, as your comment suggests, you are on OSX, you can use srm to do "secure removals".
SRM(1) SRM(1)
NAME
srm - securely remove files or directories
SYNOPSIS
srm [OPTION]... FILE...
DESCRIPTION
srm removes each specified file by overwriting, renaming, and truncat-
ing it before unlinking. This prevents other people from undeleting or
recovering any information about the file from the command line.
Online manpage is here.
Alternatively, shred is available within the GNU coreutils, which you can easily install on OS X using homebrew, with the command
brew install coreutils

How to find out which script is running at what particular time in unix?

in my application server,some files are getting deleted from one folder exactly at 1 am everyday.i have checked the crontab.wms file and there is no script which runs at 1 am.
How to find out which script is deleting the files.
Exactly 1AM makes cron a prime suspect, but processes can be launched from other places (e.g. init). Also, if the directory can be mounted elsewhere then your server may not be deleting the files. And if malware is causing this, the origin of the process could be intentionally hidden. Some information about where the files are and what the files are could be useful clues.
Repeatedly running ps -aef for several seconds may uncover the culprit. I would run it hundreds of times without sleeping between starting just before 1AM. There can be a lot of processes to examine.
You may also repeatedly run this:
/usr/sbin/lsof +d <fullNameOfTheDirectory>
to list processes that have opened the specific directory (or files in the directory). This could give a more concise list, but you have to be lucky to be probing at exactly the time the process is using the directory. You may need to try over many nights and you will want both ps and lsof.
If the files do not belong to root, you can chown root before 1AM. If the delete succeeds then you know the process is root.
I assume the deletion is messing you up. You can archive the files before 1AM and restore them when they go missing, assuming the files are fairly static. Or, you can remove write permissions for a few minutes to see if that thwarts the process (you should still see it accessing the directory). These are kludges, but could patch things up until you can really solve it.

Transfering millions of images -- RSync not good enough

We've got a folder, 130GB in size, with millions of tiny (5-20k) image files, and we need to move it from our old server (EC2) to our new server (Hetzner, Germany).
Our SQL files SCP'd over really quickly -- 20-30mb/s atleast -- and the first ~5gb or so of images transfered pretty quick, too.
Then we went home for the day, and coming back in this morning, our images have slowed to only ~5kb/s in transfer. RSync seems to slow down as it hits the middle of the workload. I've looked into alternatives, like gigasync (which doesn't seem to work), but everyone seems to agree rsync is the best option.
We have so many files, doing ls -al takes over an hour, and all my attempts at using python to batch up our transfer into smaller parts have eaten all available RAM without successfully completing.
How can I transfer all these files at a reasonable speed, using readily available tools and some light scripting?
I don't know if it will significantly faster, but maybe a
cd /folder/with/data; tar cvz | ssh target 'cd /target/folder; tar xvz'
will do the trick.
If you can, maybe restructure your file arrangement. In similiar situations, I group the files project-wise or just 1000-wise together so that a single folder doesn't have too many entries at once.
But I can imagine that the necessity of rsync (which I otherwise like very well, too) to keep a list of transferred files is responsible for the slowness. If the rsync process occupies so much RAM that it has to swap, all is lost.
So another option could be to rsync folder by folder.
It's likely that the performance issue isn't with rsync itself, but a result of having that many files in a single directory. Very few file systems perform well with a single huge folder like that. You might consider refactoring that storage to use a hierarchy of subdirectories.
Since it sounds like you're doing essentially a one-time transfer, though, you could try something along the lines of a tar cf - -C <directory> . | ssh <newhost> tar xf - -C <newdirectory> - that might eliminate some of the extra per-file communication rsync does and the extra round-trip delays, but I don't think that will make a significant improvement...
Also, note that, if ls -al is taking an hour, then by the time you get near the end of the transfer, creating each new file is likely to take a significant amount of time (seconds or even minutes), since it first has to check every entry in the directory to see if it's in fact creating a new file or overwriting an old one.

How does file system on UNIX find files?

Suppose that a request is made to ls somefile. How does the file system in UNIX handle this request from algorithmic perspective? Is that a O(1) query or O(log(N)) depending on files say starting in current directory node, or is it a O(N) linear search, or is that a combination depending on some parameters?
It can be O(n). Classic Unix file systems, based on the old school BSD fast file system and the like, store files as inode numbers, and their names are assigned at the directory level, not at the file level. This allows you have to the same file present in multiple locations at the same time, via hard links. As such, a "directory" in most Unix systems is just a file that lists filenames and inode numbers for all the files stored "in" that directory.
Searching for a particular filename in a directory just means opening that directory file and parsing through it until you find the filename's entry.
Of course, there's many different file systems available for Unix systems these days, and some will have completely differnet internal semantics for finding files, so there's no one "right" answer.
Its O(n) since the file systems has to read it off phyical media initially, but Buffer Caches will increase that significantly based on the Virtual File System (VFS) implementation on your flavor of *nix. (Notice how the first time you access a file its slower than the second time you execute the exact same command?)
To learn more read IBM's article on the Anatomy of the Unix file system.
Typical flow for a program like ls would be
Opendir on the current path.
Readdir for the current path.
Filter the entries returned by the OpenDir through filter provided on the command line. So typically O(n)
This is the generic flow, however there are many optimizations in place for special and frequent cases (;like caching of inode numbers of recent and frequent paths.
Also it depends on how directoy file are organized. In unix it is based on time of creation forcing to read every entry and increasing the look-up time to O(n). In NTFS equivalent of directory files are sorted based on name.
I can't answer your question. Maybe if you take a peak into the source code, you could answer your question yourself and explain us how it works.
ls.c
ls.h

Resources