I want to test if rsync will work to sync some huge DVD images containing installers in order too see if what speedup can I obtain from using rsync, if any.
I would like to run the test locally, how can I convince rsync to just evaluate how much data would be required in order to sync the two files?
PS. I am fully aware that I should try to sync small and uncompressed files, but this is outside the question in this case.
Just use
rsync -avz --log-file="/Users/username/rsync.log" /home/test /home/testlocation
then check log file for size and speed
Related
I use rsync to backup a folder of 60G between my laptop and an external USB drive. Only 4G of data have been added. It took a long time to finish : 2 hours.
Here is the command :
rsync -av --exclude=target/ --exclude=".git/" --delete --link-dest=$destdir/backup.1 $element $destdir/backup.0
Do you have an explanation ?
What slows down rsync more : a lot of small files or big binary files (photos) ?
As I don't exactly know your system, I am making a few assumptions here. If they don't match your situation, please clarify your question and I'll happily update my answer.
I am assuming you have a lot of files, no matter their sizes in the location you are copying from. This will cause a rather slow rsync caused by the design of the rsync protocol.
rsync works like this:
1. Build a file-list of the source location.
2. For all files in the source location:
a. Get the size and the mtime (modification timestamp)
b. Compare it with the size and mtime of the copy in the destination location
c. If they differ, copy the file from the source to the destination
Done.
If you just have a few files, this will obviously be faster than for many files. Your usb drive might be your bottleneck, as retrieval of the size and timestamp will create a lot of jumps in the inode table.
Maybe a tool like iotop (in the case your on Linux, similar tools are available for almost all platforms) can help you identify the bottleneck.
The --delete option can also cause a slow rsync, if retrieving the complete file-list of the target location is slow (which is probable for an external, rotating USB disk). To verify that this is the problem, on any os with a bash, just type time ls -Ral <target-location> > filelist.txt (diverting the output to a file since putting out data on the screen is way slower). If this takes a lot longer than for your source disk, your target disk could be the bottleneck.
I'm using rsync to backup my files. I choose rysnc because it (should) use modification times to determine if changes have been made and if files need to be updated.
I started my backup (from my computer system (debian) to a portable external hard drive) with this command:
rsync -avz --update --delete --stats --progress --exclude-from=/home/user/scripts/ExclusionRSync --backup --backup-dir=/media/user/hdd/backups/deleted-files /home/user/ /media/user/hdd/backups/backup_user
It worked well and took a lot of time. I believed the second time would be very quick (since I didn't modify files). Unfortunately, the 2nd, 3th, 4th, ... times took as long as the first one. I still see all my files being copied even if these files are already in my portable hard drive.
I don't understand why rsync doesn't copy only modifications (rsync is known to be efficient and only copy changes and I specificly call --update option).
A side effect of this problem is that all files are moved to my backup dir (deleted-filed) as soon as they are transfered. Indeed, rsync delete the previous file before to copy the same file during each update...
I found the solution reading an answer on Serverfault.SE. The Fat filesystem was messing with timestamps:
FAT doesn't track modification times on files as precisely as, say
ext3 (FAT is only precise to within a 2 second window). This leads to
particularly nasty behavior with rsync as it will sometimes decide
that the original files is newer or older than the backup file by
enough that it needs to re-copy the data or at least re-check the
hashes. All in all, it makes for very poor performance on backups. If
you must stick with FAT, look into rsync's --size-only and
--modify-window flags as workarounds.
I'm running file synchronization over HTTP. Both sides implement rsync. When synchronizing, for uploading I have two choices:
use a simple post request if:
the file to be uploaded does'nt exists on the remote side.
the file exists and is bigger than a certain value M.
else : perform rsync over get requests.
My question is: How can I determine the perfect value of M.
I'm certain that for a certain file size, performing simple upload is faster than performing rsync steps . Especially for multiple files.
Thanks
If you're using rsync correctly, I'd bet that it's always faster, especially with multiple files.
Rsync is specially built to check differences between directory trees and update the target directory incrementatlly.
The following is a one-liner to keep in mind whenever you need to sync two directory trees.
rsync -av --delete /path/to/src /path/to/target
(also works over SSH, if necessary.)
Only keep in mind that rsync is picky about trailing slashes on directory paths.
We've got a folder, 130GB in size, with millions of tiny (5-20k) image files, and we need to move it from our old server (EC2) to our new server (Hetzner, Germany).
Our SQL files SCP'd over really quickly -- 20-30mb/s atleast -- and the first ~5gb or so of images transfered pretty quick, too.
Then we went home for the day, and coming back in this morning, our images have slowed to only ~5kb/s in transfer. RSync seems to slow down as it hits the middle of the workload. I've looked into alternatives, like gigasync (which doesn't seem to work), but everyone seems to agree rsync is the best option.
We have so many files, doing ls -al takes over an hour, and all my attempts at using python to batch up our transfer into smaller parts have eaten all available RAM without successfully completing.
How can I transfer all these files at a reasonable speed, using readily available tools and some light scripting?
I don't know if it will significantly faster, but maybe a
cd /folder/with/data; tar cvz | ssh target 'cd /target/folder; tar xvz'
will do the trick.
If you can, maybe restructure your file arrangement. In similiar situations, I group the files project-wise or just 1000-wise together so that a single folder doesn't have too many entries at once.
But I can imagine that the necessity of rsync (which I otherwise like very well, too) to keep a list of transferred files is responsible for the slowness. If the rsync process occupies so much RAM that it has to swap, all is lost.
So another option could be to rsync folder by folder.
It's likely that the performance issue isn't with rsync itself, but a result of having that many files in a single directory. Very few file systems perform well with a single huge folder like that. You might consider refactoring that storage to use a hierarchy of subdirectories.
Since it sounds like you're doing essentially a one-time transfer, though, you could try something along the lines of a tar cf - -C <directory> . | ssh <newhost> tar xf - -C <newdirectory> - that might eliminate some of the extra per-file communication rsync does and the extra round-trip delays, but I don't think that will make a significant improvement...
Also, note that, if ls -al is taking an hour, then by the time you get near the end of the transfer, creating each new file is likely to take a significant amount of time (seconds or even minutes), since it first has to check every entry in the directory to see if it's in fact creating a new file or overwriting an old one.
I have a great deal of data to keep synchronized over 4 or 5 sites around the world, around half a terabyte at each site. This changes (either adds or changes) by around 1.4 Gigabytes per day, and the data can change at any of the four sites.
A large percentage (30%) of the data is duplicate packages (Perhaps packaged-up JDKs), so the solution would have to include a way of picking up the fact that there are such things lying aruond on the local machine and grab them instead of downloading from another site.
The control of versioning is not an issue, this is not a codebase per-se.
I'm just interested if there are any solutions out there (preferably open-source) that get close to such a thing?
My baby script using rsync doesn't cut the mustard any more, I'd like to do more complex, intelligent synchronization.
Thanks
Edit : This should be UNIX based :)
Have you tried Unison?
I've had good results with it. It's basically a smarter rsync, which maybe is what you want. There is a listing comparing file syncing tools here.
Sounds like a job for BitTorrent.
For each new file at each site, create a bittorrent seed file and put it into centralized web-accessible dir.
Each site then downloads (via bittorrent) all files. This will gen you bandwidth sharing and automatic local copy reuse.
Actual recipe will depend on your need.
For example, you can create 1 bittorrent seed for each file on each host, and set modification time of the seed file to be the same as the modification time of the file itself. Since you'll be doing it daily (hourly?) it's better to use something like "make" to (re-)create seed files only for new or updated files.
Then you copy all seed files from all hosts to the centralized location ("tracker dir") with option "overwrite only if newer". This gets you a set of torrent seeds for all newest copies of all files.
Then each host downloads all seed files (again, with "overwrite if newer setting") and starts bittorrent download on all of them. This will download/redownload all the new/updated files.
Rince and repeat, daily.
BTW, there will be no "downloading from itself", as you said in the comment. If file is already present on the local host, its checksum will be verified, and no downloading will take place.
How about something along the lines of Red Hat's Global Filesystem, so that the whole structure is split across every site onto multiple devices, rather than having it all replicated at each location?
Or perhaps a commercial network storage system such as from LeftHand Networks (disclaimer - I have no idea on cost, and haven't used them).
You have a lot of options:
You can try out to set up replicated DB to store data.
Use combination of rsync or lftp and custom scripts, but that doesn't suit you.
Use git repos with max compressions and sync between them using some scripts
Since the amount of data is rather large, and probably important, do either some custom development on hire an expert ;)
Check out super flexible.... it's pretty cool, haven't used it in a large scale environment, but on a 3-node system it seemed to work perfectly.
Sounds like a job for Foldershare
Have you tried the detect-renamed patch for rsync (http://samba.anu.edu.au/ftp/rsync/dev/patches/detect-renamed.diff)? I haven't tried it myself, but I wonder whether it will detect not just renamed but also duplicated files. If it won't detect duplicated files, then, I guess, it might be possible to modify the patch to do so.