rsync doesn't copy *only* modifications - rsync

I'm using rsync to backup my files. I choose rysnc because it (should) use modification times to determine if changes have been made and if files need to be updated.
I started my backup (from my computer system (debian) to a portable external hard drive) with this command:
rsync -avz --update --delete --stats --progress --exclude-from=/home/user/scripts/ExclusionRSync --backup --backup-dir=/media/user/hdd/backups/deleted-files /home/user/ /media/user/hdd/backups/backup_user
It worked well and took a lot of time. I believed the second time would be very quick (since I didn't modify files). Unfortunately, the 2nd, 3th, 4th, ... times took as long as the first one. I still see all my files being copied even if these files are already in my portable hard drive.
I don't understand why rsync doesn't copy only modifications (rsync is known to be efficient and only copy changes and I specificly call --update option).
A side effect of this problem is that all files are moved to my backup dir (deleted-filed) as soon as they are transfered. Indeed, rsync delete the previous file before to copy the same file during each update...

I found the solution reading an answer on Serverfault.SE. The Fat filesystem was messing with timestamps:
FAT doesn't track modification times on files as precisely as, say
ext3 (FAT is only precise to within a 2 second window). This leads to
particularly nasty behavior with rsync as it will sometimes decide
that the original files is newer or older than the backup file by
enough that it needs to re-copy the data or at least re-check the
hashes. All in all, it makes for very poor performance on backups. If
you must stick with FAT, look into rsync's --size-only and
--modify-window flags as workarounds.

Related

Rsync - How to display only changed files

When my colleague and I upload a PHP web project to production, we use rsync for the file transfer with these arguments:
rsync -rltz --progress --stats --delete --perms --chmod=u=rwX,g=rwX,o=rX
When this runs, we see a long list of files that were changed.
Running this 2 times in a row, will always show the files that were changed between the 2 transfers.
However, when my colleague runs the same command after I did it, he will see a very long list of all files being changed (though the contents are identical) and this is extremely fast.
If he uploads again, then again there will be only minimal output.
So it seams to me that we get the correct output, only showing changes, but if someone else uploads from another computer, rsync will regard everything as changed.
I believe this may have something to do with the file permissions or times, but would like to know how to best solve this.
The idea is that we only see the changes, regardless who does the upload and in what order.
The huge file list is quite scary to see in a huge project, so we have no idea what was actually changed.
PS: We both deploy using the same user#server as target.
The t in your command says to copy the timestamps of the files, so if they don't match you'll see them get updated. If you think the timestamps on your two machines should match then the problem is something else.
The easiest way to ensure that the timestamps match would be to rsync them down from the server before making your edits.
Incidentally, having two people use rsync to update a production server seems error prone and fragile. You should consider putting your files in Git and pushing them to the server that way (you'd need a server-side hook to update the working copy for the web server to use).

Why is rsync so slow?

I use rsync to backup a folder of 60G between my laptop and an external USB drive. Only 4G of data have been added. It took a long time to finish : 2 hours.
Here is the command :
rsync -av --exclude=target/ --exclude=".git/" --delete --link-dest=$destdir/backup.1 $element $destdir/backup.0
Do you have an explanation ?
What slows down rsync more : a lot of small files or big binary files (photos) ?
As I don't exactly know your system, I am making a few assumptions here. If they don't match your situation, please clarify your question and I'll happily update my answer.
I am assuming you have a lot of files, no matter their sizes in the location you are copying from. This will cause a rather slow rsync caused by the design of the rsync protocol.
rsync works like this:
1. Build a file-list of the source location.
2. For all files in the source location:
a. Get the size and the mtime (modification timestamp)
b. Compare it with the size and mtime of the copy in the destination location
c. If they differ, copy the file from the source to the destination
Done.
If you just have a few files, this will obviously be faster than for many files. Your usb drive might be your bottleneck, as retrieval of the size and timestamp will create a lot of jumps in the inode table.
Maybe a tool like iotop (in the case your on Linux, similar tools are available for almost all platforms) can help you identify the bottleneck.
The --delete option can also cause a slow rsync, if retrieving the complete file-list of the target location is slow (which is probable for an external, rotating USB disk). To verify that this is the problem, on any os with a bash, just type time ls -Ral <target-location> > filelist.txt (diverting the output to a file since putting out data on the screen is way slower). If this takes a lot longer than for your source disk, your target disk could be the bottleneck.

synchronization over http: rsync versus normal upload

I'm running file synchronization over HTTP. Both sides implement rsync. When synchronizing, for uploading I have two choices:
use a simple post request if:
the file to be uploaded does'nt exists on the remote side.
the file exists and is bigger than a certain value M.
else : perform rsync over get requests.
My question is: How can I determine the perfect value of M.
I'm certain that for a certain file size, performing simple upload is faster than performing rsync steps . Especially for multiple files.
Thanks
If you're using rsync correctly, I'd bet that it's always faster, especially with multiple files.
Rsync is specially built to check differences between directory trees and update the target directory incrementatlly.
The following is a one-liner to keep in mind whenever you need to sync two directory trees.
rsync -av --delete /path/to/src /path/to/target
(also works over SSH, if necessary.)
Only keep in mind that rsync is picky about trailing slashes on directory paths.

Transfering millions of images -- RSync not good enough

We've got a folder, 130GB in size, with millions of tiny (5-20k) image files, and we need to move it from our old server (EC2) to our new server (Hetzner, Germany).
Our SQL files SCP'd over really quickly -- 20-30mb/s atleast -- and the first ~5gb or so of images transfered pretty quick, too.
Then we went home for the day, and coming back in this morning, our images have slowed to only ~5kb/s in transfer. RSync seems to slow down as it hits the middle of the workload. I've looked into alternatives, like gigasync (which doesn't seem to work), but everyone seems to agree rsync is the best option.
We have so many files, doing ls -al takes over an hour, and all my attempts at using python to batch up our transfer into smaller parts have eaten all available RAM without successfully completing.
How can I transfer all these files at a reasonable speed, using readily available tools and some light scripting?
I don't know if it will significantly faster, but maybe a
cd /folder/with/data; tar cvz | ssh target 'cd /target/folder; tar xvz'
will do the trick.
If you can, maybe restructure your file arrangement. In similiar situations, I group the files project-wise or just 1000-wise together so that a single folder doesn't have too many entries at once.
But I can imagine that the necessity of rsync (which I otherwise like very well, too) to keep a list of transferred files is responsible for the slowness. If the rsync process occupies so much RAM that it has to swap, all is lost.
So another option could be to rsync folder by folder.
It's likely that the performance issue isn't with rsync itself, but a result of having that many files in a single directory. Very few file systems perform well with a single huge folder like that. You might consider refactoring that storage to use a hierarchy of subdirectories.
Since it sounds like you're doing essentially a one-time transfer, though, you could try something along the lines of a tar cf - -C <directory> . | ssh <newhost> tar xf - -C <newdirectory> - that might eliminate some of the extra per-file communication rsync does and the extra round-trip delays, but I don't think that will make a significant improvement...
Also, note that, if ls -al is taking an hour, then by the time you get near the end of the transfer, creating each new file is likely to take a significant amount of time (seconds or even minutes), since it first has to check every entry in the directory to see if it's in fact creating a new file or overwriting an old one.

How to test compress rate of rsync with two big local files?

I want to test if rsync will work to sync some huge DVD images containing installers in order too see if what speedup can I obtain from using rsync, if any.
I would like to run the test locally, how can I convince rsync to just evaluate how much data would be required in order to sync the two files?
PS. I am fully aware that I should try to sync small and uncompressed files, but this is outside the question in this case.
Just use
rsync -avz --log-file="/Users/username/rsync.log" /home/test /home/testlocation
then check log file for size and speed

Resources