Why is rsync so slow? - rsync

I use rsync to backup a folder of 60G between my laptop and an external USB drive. Only 4G of data have been added. It took a long time to finish : 2 hours.
Here is the command :
rsync -av --exclude=target/ --exclude=".git/" --delete --link-dest=$destdir/backup.1 $element $destdir/backup.0
Do you have an explanation ?
What slows down rsync more : a lot of small files or big binary files (photos) ?

As I don't exactly know your system, I am making a few assumptions here. If they don't match your situation, please clarify your question and I'll happily update my answer.
I am assuming you have a lot of files, no matter their sizes in the location you are copying from. This will cause a rather slow rsync caused by the design of the rsync protocol.
rsync works like this:
1. Build a file-list of the source location.
2. For all files in the source location:
a. Get the size and the mtime (modification timestamp)
b. Compare it with the size and mtime of the copy in the destination location
c. If they differ, copy the file from the source to the destination
Done.
If you just have a few files, this will obviously be faster than for many files. Your usb drive might be your bottleneck, as retrieval of the size and timestamp will create a lot of jumps in the inode table.
Maybe a tool like iotop (in the case your on Linux, similar tools are available for almost all platforms) can help you identify the bottleneck.
The --delete option can also cause a slow rsync, if retrieving the complete file-list of the target location is slow (which is probable for an external, rotating USB disk). To verify that this is the problem, on any os with a bash, just type time ls -Ral <target-location> > filelist.txt (diverting the output to a file since putting out data on the screen is way slower). If this takes a lot longer than for your source disk, your target disk could be the bottleneck.

Related

Replacing static files that are under heavy read load

Let us assume we have a static file server (Nginx + Linux) that serves 10 files. The files are read almost as frequently as the server can process. However, some of the files need to be replaced with new versions, so that the filename and URL address remain unaltered. How to replace the files safely without a fear that some reads fail or become a mix of two versions?
I understand this is a rather basic operating system matter and has something to do with renames, symlinks, and file sizes. However, I failed to find a clear reference or a good discussion and I hope we can build one here.
Use rsync. Typically I choose rsync -av src dst, but YMMV.
What is terrific about rsync is that, in addition to having essentially zero cost when little or nothing changed, it uses atomic rename. So during file transfer, a ".fooNNNNN" temp file gets bigger and bigger. Once completed, rsync closes the file and renames it on top of "foo". So web clients either see all of the old, or all of the new file. Notice that range downloads (say from restart after error) are not atomic, exposing such clients to lossage, especially if bytes were inserted near beginning of file. SHA1 wouldn't validate for such a client, and he would have to restart his download from scratch. BTW, if these are "large" files, tell nginx to use zero-copy sendfile().

rsync doesn't copy *only* modifications

I'm using rsync to backup my files. I choose rysnc because it (should) use modification times to determine if changes have been made and if files need to be updated.
I started my backup (from my computer system (debian) to a portable external hard drive) with this command:
rsync -avz --update --delete --stats --progress --exclude-from=/home/user/scripts/ExclusionRSync --backup --backup-dir=/media/user/hdd/backups/deleted-files /home/user/ /media/user/hdd/backups/backup_user
It worked well and took a lot of time. I believed the second time would be very quick (since I didn't modify files). Unfortunately, the 2nd, 3th, 4th, ... times took as long as the first one. I still see all my files being copied even if these files are already in my portable hard drive.
I don't understand why rsync doesn't copy only modifications (rsync is known to be efficient and only copy changes and I specificly call --update option).
A side effect of this problem is that all files are moved to my backup dir (deleted-filed) as soon as they are transfered. Indeed, rsync delete the previous file before to copy the same file during each update...
I found the solution reading an answer on Serverfault.SE. The Fat filesystem was messing with timestamps:
FAT doesn't track modification times on files as precisely as, say
ext3 (FAT is only precise to within a 2 second window). This leads to
particularly nasty behavior with rsync as it will sometimes decide
that the original files is newer or older than the backup file by
enough that it needs to re-copy the data or at least re-check the
hashes. All in all, it makes for very poor performance on backups. If
you must stick with FAT, look into rsync's --size-only and
--modify-window flags as workarounds.

synchronization over http: rsync versus normal upload

I'm running file synchronization over HTTP. Both sides implement rsync. When synchronizing, for uploading I have two choices:
use a simple post request if:
the file to be uploaded does'nt exists on the remote side.
the file exists and is bigger than a certain value M.
else : perform rsync over get requests.
My question is: How can I determine the perfect value of M.
I'm certain that for a certain file size, performing simple upload is faster than performing rsync steps . Especially for multiple files.
Thanks
If you're using rsync correctly, I'd bet that it's always faster, especially with multiple files.
Rsync is specially built to check differences between directory trees and update the target directory incrementatlly.
The following is a one-liner to keep in mind whenever you need to sync two directory trees.
rsync -av --delete /path/to/src /path/to/target
(also works over SSH, if necessary.)
Only keep in mind that rsync is picky about trailing slashes on directory paths.

Transfering millions of images -- RSync not good enough

We've got a folder, 130GB in size, with millions of tiny (5-20k) image files, and we need to move it from our old server (EC2) to our new server (Hetzner, Germany).
Our SQL files SCP'd over really quickly -- 20-30mb/s atleast -- and the first ~5gb or so of images transfered pretty quick, too.
Then we went home for the day, and coming back in this morning, our images have slowed to only ~5kb/s in transfer. RSync seems to slow down as it hits the middle of the workload. I've looked into alternatives, like gigasync (which doesn't seem to work), but everyone seems to agree rsync is the best option.
We have so many files, doing ls -al takes over an hour, and all my attempts at using python to batch up our transfer into smaller parts have eaten all available RAM without successfully completing.
How can I transfer all these files at a reasonable speed, using readily available tools and some light scripting?
I don't know if it will significantly faster, but maybe a
cd /folder/with/data; tar cvz | ssh target 'cd /target/folder; tar xvz'
will do the trick.
If you can, maybe restructure your file arrangement. In similiar situations, I group the files project-wise or just 1000-wise together so that a single folder doesn't have too many entries at once.
But I can imagine that the necessity of rsync (which I otherwise like very well, too) to keep a list of transferred files is responsible for the slowness. If the rsync process occupies so much RAM that it has to swap, all is lost.
So another option could be to rsync folder by folder.
It's likely that the performance issue isn't with rsync itself, but a result of having that many files in a single directory. Very few file systems perform well with a single huge folder like that. You might consider refactoring that storage to use a hierarchy of subdirectories.
Since it sounds like you're doing essentially a one-time transfer, though, you could try something along the lines of a tar cf - -C <directory> . | ssh <newhost> tar xf - -C <newdirectory> - that might eliminate some of the extra per-file communication rsync does and the extra round-trip delays, but I don't think that will make a significant improvement...
Also, note that, if ls -al is taking an hour, then by the time you get near the end of the transfer, creating each new file is likely to take a significant amount of time (seconds or even minutes), since it first has to check every entry in the directory to see if it's in fact creating a new file or overwriting an old one.

How to test compress rate of rsync with two big local files?

I want to test if rsync will work to sync some huge DVD images containing installers in order too see if what speedup can I obtain from using rsync, if any.
I would like to run the test locally, how can I convince rsync to just evaluate how much data would be required in order to sync the two files?
PS. I am fully aware that I should try to sync small and uncompressed files, but this is outside the question in this case.
Just use
rsync -avz --log-file="/Users/username/rsync.log" /home/test /home/testlocation
then check log file for size and speed

Resources