Can we copy files with incremental checksum computation? - rsync

I know using rsync -c we can copy files and having them verified with checksum. The issue is that rsync first build a list of files and at the same time it computes the checksum for all files. This would take forever if you have millions of files. Is there a solution where the checksum is simply computed as the application reads the file streams for the copy task?

This should be the default behaviour for rsync since 3.0.0, assuming that you don't have all the files in one directory.
However, there are some options that force rsync to scan all the files up front; look at the man page and search for "incremental" for more details.

Related

Btrfs and rsync

So I have been searching for this up and down, but must be doing something fundamentally wrong. What I want to do:
I have configured my NAS to make snapshots of my home folder, file system is btrfs.
That works as it should and is utilizing hard links.
I want to copy off the whole snapshots directory for backup on an ext4 usb disk, using rsync -aH to preserve the hard links.
But hard links are not preserved after rsync completes - I am down to a minimum example where I rsync a file of 2 different snapshots (verified to have identical Inodes) on the btrfs volume - just to another directory - which also creates 2 distinct files.
Am I missing an rsync option here to make this work? Or is rsync simply incapable to do this? Wrong tool for the job?
The files have the same inode number, but btrfs presents them in different (virtual) file systems. Check the "stat" output and you should see that the device is different. rsync correctly determines that these are not true hard links.
If you think about it, this makes sense because if you edit a file its inode number does not change (usually), but before and after snapshots will show different content.
The correct way to achieve what you want is to do a normal rsync from inside the first snapshot. Then do another rsync from the second snapshot to a fresh destination directory, but give the --link-dest=<first-dest-dir> option. This will create a new snapshot with hardlinks into the old snapshot, wherever the files are identical.
E.g.
rsync -aH /.snapshot1/ dest1/
rsync -aH --link-dest=dest1/ /.snapshot2/ dest2/
rsync -aH --link-dest=dest2/ /.snapshot3/ dest3/
rsync -aH --link-dest=dest3/ /.snapshot4/ dest4/
rsync -aH --link-dest=dest4/ /.snapshot5/ dest5/
rsync -aH --link-dest=dest5/ /.snapshot6/ dest6/
You can think of it as doing a cp --link dest1 dest2 before rsync (which would have a similar effect, as long as you don't use --inplace).

copy with rsync when files are different

I have to copy a big directory to my NAS using rsync, I would like to say to rsync only copy the files when source and destination are different to avoid to copy a files already copied.
Skipping identical files is the whole purpose why people use rsync. This is default behavior of rsync. Most of the time the only option you want to use is -a:
rsync -a -P <source> <dest>
The -P just means show progress and the -a means "archive" and that means "when copying files, try to make copy as identical as possible" (try to keep permissions, ownership, timestamps, etc.) but is also means "Only update files if you have to". It's like saying "make sure <dest> is an up-to-date backup of <source>".
However, by default rsync will already consider two files identical, if they have same file size and same last modification date. Of course, two files may also have same size and same last modification date and not be identical. So when running that command for the very first time and you are not sure which files may need update and which ones don't, try this:
rsync -a -c -P <source> <dest>
-c means don't rely just upon size and date, checksum every file and compare the checksums. Only if checkums are identical, consider files as identical. Note that rsync will not necessary checksum the whole file, big files are broken into smaller chunks and every chunk is checksumed separately as only chunks that have changed are transferred.
So even with checksuming you can save you a lot of time when copying over a network connection. It won't save you any time when copying locally because just copying everything is probably faster than checksuming everything. So a plain copy will always beat a checksuming rsync in speed when both, source and destination, are local drives. In that case use
cp -a -v <source> <dest>
or if your system doesn't know -a, use
cp -pPR -v <source> <dest>
that's identical to -a. Again, the -v is just to see some progress.
And I'd only use -c for the very first sync, after that, relying on file size and last modification date usually works very well for updating and it is a whole lot faster. It will work because if a file has been altered since the last sync, it will have a different last modification date and so by just comparing the dates rysnc will know that the file must be updated at the destination. Of course, that only works if your systems all have the correct date/time set and if you don't manipulate the last modification date of files and also don't forbid your system to update them.
If you want to skip files solely on presence, use this:
rsync -a -P --ignore-existing <source> <dest>
That's like telling rsync "If you see a file with the same name at the destination, always consider it to be identical and never update it".
Please note that if -a detects a file in <source> is different than a files in <dist>, whether this is determined by size and modification date or by checksumming, it will always update the file at <dest> to match then file at <source>. If multiple sources are syncing to the same destination, you might also want to add -u which means "in case two files are different, only update if the file at <source> has a newer last modification date than then file at <dest>"
Just as a general tip, if you type
man <command>
in a terminal, you will get a nice help page on most systems (Linux, MacOS X and UNIX systems), explaining you all the options in all detail. You can scroll up/down using arrow keys or page up/down and you can leave that view by hitting "q" for quit. E.g.
man rsync

synchronise local directories over ssh

The following command works great for me for a single file:
scp your_username#remotehost.edu:foobar.txt /some/local/directory
What I want to do is do it recursive (i.e. for all subdirectories / subfiles of a given path on server), merge folders and overwrite files that already exist locally, and finally downland only those files on server that are smaller than a certain value (e.g. 10 mb).
How could I do that?
Use rsync.
Your command is likely to look like this:
rsync -az --max-size=10m your_username#remotehost.edu:foobar.txt /some/local/directory
-a (archive mode - the sync is recursive, transfers ownership, attributes, symlinks among other things)
-z (compresses transfer)
--max-size (only copies files up to a certain size)
There are many more flags which may be suitable. Checkout the docs for more details - http://linux.die.net/man/1/rsync
First option: use rsync.
Second option, and it's not going to be a one liner, but can be done in three or four lines:
Create a tar archive on the remote system using ssh.
Copy the tar from remote system with scp.
Untar the archive locally.
If the creation of the archive gets a bit complicated and involves using find and/or tar with several options it is quite practical to create a script which would do that locally, upload it on the server with scp, and only then execute remotely with ssh.

Rsync previous half-copied files?

I found rsync behaves differently in the following two situations:
(1) All the files are copied by using rsync, then using rsync again will be fast (skip all the files);
(2) Use cp to copy files, then using rsync will be slow (or may be run freshly?)
So my confusion is "Does rsync generate any internal things on the files so that it can refer to avoid duplicate checking?"
rsync -a (in archive mode, which I presume you ran) retains all attributes of a file, including creation/modification time. cp does not. I suppose something in the file attributes that's different when you use cp, probably a later modification time, in the destination files, made rsync think they are newer files, so it either recopied them or had to check the contents.

Copy or rsync command

The following command is working as expected...
cp -ur /home/abc/* /mnt/windowsabc/
Does rsync has any advantage over it? Is there a better way to keep to backup folder in sync every 24 hours?
Rsync is better since it will only copy only the updated parts of the updated file, instead of the whole file. It also uses compression and encryption if you want. Check out this tutorial.
rsync is not necessarily more efficient, due to the more detailed inventory of files and blocks it performs. The algorithm is fantastic at what it does, but you need to understand your problem to know if it is really going to be the best choice.
On a very large file system (say many thousands or millions of files) where files tend to be added but not updated, "cp -u" will likely be more efficient. cp makes the decision to copy solely on metadata and can simply get to the business of copying.
Note that you might want some buffering, e.g. by using tar rather than straight cp, depending on the size of the files, network performance, other disk activity, etc. I find the following idea very useful:
tar cf - . | tar xCf directory -
Metadata itself may actually become a significant overhead on very large (cluster) file systems, but rsync and cp will share this problem.
rsync seems to frequently be the preferred tool (and in general purpose applications is my usual default choice), but there are probably many people who blindly use rsync without thinking it through.
The command as written will create new directories and files with the current date and time stamp, and yourself as the owner. If you are the only user on your system and you are doing this daily it may not matter much. But if preserving those attributes matters to you, you can modify your command with
cp -pur /home/abc/* /mnt/windowsabc/
The -p will preserve ownership, timestamps, and mode of the file. This can be pretty important depending on what you're backing up.
The alternative command with rsync would be
rsync -avh /home/abc/* /mnt/windowsabc
With rsync, -a indicates "archive" which preserves all those attributes mentioned above. -v indicates "verbose" which just lists what it's doing with each file as it runs. -z is left out here for local copies, but is for compression, which will help if you are backing up over a network. Finally, the -h tells rsync to report sizes in human-readable formats like MB,GB,etc.
Out of curiosity, I ran one copy to prime the system and avoid biasing against the first run, then I timed the following on a test run of 1GB of files from an internal SSD drive to a USB-connected HDD. These simply copied to empty target directories.
cp -pur : 19.5 seconds
rsync -ah : 19.6 seconds
rsync -azh : 61.5 seconds
Both commands seem to be about the same, although zipping and unzipping obviously tax the system where bandwidth is not a bottleneck.
Especially if you use a copy-on-write filesystem like BTRFS or ZFS, rsync is much better.
I use BTRFS, and I have this in my ~/.bashrc:
alias cp="rsync -ah --inplace --no-whole-file --info=progress2"
The important flag here for CoW FSs like BTRFS is --inplace because it only copies the changed part of the files, doesn't create new inodes for small changes between files, etc. See this.
It's not really a question of what's more efficient.
The commands 'rsync', and 'cp' are not equivalent and achieve different goals.
1- rsync can preserve the time of creation of existing files. (using -a option)
2- rsync will run multiprocess and transfer using either local sockets or network sockets. (i.e. fork itself into multiple processes)
3- The multiprocessing, and threading will increase your throughput when copying large number of small files, and even with multiple larger files.
So bottom line is rsync is for large data, and cp is for smaller local copying. (MB to small GB range). When you start getting into multiple GB or in the TB range, go with rsync. And of course network copies, rsync all the way.
For a local copy, the only advantage of rsync is that it will avoid copying if the file already exists in the destination directory. The definition of "already exists" is (a) same file name (b) same size (c) same timestamp. (Maybe same owner/group; I am not sure...)
The "rsync algorithm" is great for incremental updates of a file over a slow network link, but it will not buy you much for a local copy, as it needs to read the existing (partial) file to run it's "diff" computation.
So if you are running this sort of command frequently, and the set of changed files is small relative to the total number of files, you should find that rsync is faster than cp. (Also rsync has a --delete option that you might find useful.)
Keep in mind that while transferring files internally on a machine i.e not network transfer, using the -z flag can have a massive difference in the time taken for the transfer.
Transfer within same machine
Case 1: With -z flag:
TAR took: 9.48345208168
Encryption took: 2.79352903366
CP took = 5.07273387909
Rsync took = 30.5113282204
Case 2: Without the -z flag:
TAR took: 10.7535531521
Encryption took: 3.0386879921
CP took = 4.85565590858
Rsync took = 4.94515299797
if you are using cp doesn't save existing files when copying folders of the same name. Lets say you have this folders:
/myFolder
someTextFile.txt
/someOtherFolder
/myFolder
wellHelloThere.txt
Then you copy one over the other:
cp /someOtherFolder/myFolder /myFolder
result:
/myFolder
wellHelloThere.txt
This is at least what happens on macOS and I wanted to preserve the diff files so I used rsync.
I will prefer to use rsync with the following options
rsync -avhW --no-compress --progress --info=progress2 <src directory> <dst directory>
The above parameters can be defined as follows :
-a for the archive to preserves ownership, permissions, etc.
-v for verbose
-h for human-readable
-W for copying whole files only
--no-compress as there's no lack of bandwidth between local devices
--progress to see the progress of large files
--info=progress2 to see the overall progress
source directory path
destination directory path
rsync is much much better compared to cp because rsync copies whole files/directory only the first time. The next time when you use rsync command with the same files/directory, only new changes are copied to the destination folder, not the entire files are copied.
I used rsynk to transfer 330G data from a local HD to a external HD via USB 3.0. It took me three days. The transfer rate went down to 800 Kb/s and rised to 50 M/s for a while only after pausing the job. It is a typical overbuffering issue. Bad experience for local file tranfers: as the name indicates, (R)sync stands for REMOTE-sync (optimized for tranfers via network). As often happens, I discovered the "-z" flag only after I wondered about the issue and looked for an understandment

Resources