Issues using rsync to migrate files to new server - rsync

I am trying to copy a directory full of directories and small files to a new server for an app migration. rsync is always my go to tool for this type of migration but this time it is not working as expected.
The directory has 174,412 files and is 136g in size. Based on this I created a 256G disk for them on the new server.
The issue is when I rsync'd the files over to the new server the new partition ran out of space before all files were copied.
I did some tests with a bigger destination disk on my test machine and when it finishes the total size on the new disk is 272G
time sudo rsync -avh /mnt/dotcms/* /data2/
sent 291.61G bytes received 2.85M bytes 51.75M bytes/sec
total size is 291.52G speedup is 1.00
df -h /data2
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/data2vg-data2lv 425G 272G 154G 64% /data2
The source is on a NAS and the new target is a XFS file system so first I thought it may be a block size issue. But then I used the cp command and it copied the exact same size.
time sudo cp -av /mnt/dotcms/* /data
df -h /data2
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/data2vg-data2lv 425G 136G 290G 32% /data2
Why is rsync increasing the space used?

According to the documentation, dotcms makes use of hard links. So, you need to give rsync the -H option to preserve them. Note that GNU's cp -av preserves hard links so doesn't have this problem.
Other rsync options you should consider using include:
-H, --hard-links : preserve hard links
-A, --acls : preserve ACLs (implies --perms)
-X, --xattrs : preserve extended attributes
-S, --sparse : turn sequences of nulls into sparse blocks
--delete : delete extraneous files from destination dirs
This assumes you are running as root and that the destination is supposed to have the same users/groups as the source. If the users and groups are not the same, then #Cyrus' alternative commandline using --numeric-id may be more appropriate.

Related

Fastest way to move 90 million files (270GB) between two NFS 1Gb/s folders

I need to move 90 million files from a NFS folder to a second NFS folder, both connections to NFS folder are using same eth0, which is 1Gb/s to the NFS servers, Sync is not needed, only move (overwrite if it exists). I think my main problem is the number of files, not the total size. The best way should be the way with less system calls per file to the NFS folders.
I tried cp, rsync, and finally http://moo.nac.uci.edu/~hjm/parsync/ parsync first took 10 hours in generate 12 GB gzip of the file list, after it took 40 hours and no one file was copied, it was working to 10 threads until I canceled it and started debugging, I found it is doing a call (stat ?) again to each file (from the list) with the -vvv option (it uses rsync):
[sender] make_file(accounts/hostingfacil/snap.2017-01-07.041721/hostingfacil/homedir/public_html/members/vendor/composer/62ebc48e/vendor/whmcs/whmcs-foundation/lib/Domains/DomainLookup/Provider.php,*,0)*
the parsync command is:
time parsync --rsyncopts="-v -v -v" --reusecache --NP=10 --startdir=/nfsbackup/folder1/subfolder2 thefolder /nfsbackup2/folder1/subfolder2
Each rsync has this form:
rsync --bwlimit=1000000 -v -v -v -a --files-from=/root/.parsync/kds-chunk-9 /nfsbackup/folder1/subfolder2 /nfsbackup2/folder1/subfolder2
The NFS folders are mounted:
server:/export/folder/folder /nfsbackup2 nfs auto,noexec,noatime,nolock,bg,intr,tcp,actimeo=1800,nfsvers=3,vers=3 0 0
Any idea how to instruct the rsync to copy the files already in the list from the nfs to the nfs2 folder? Or any way to make this copy efficiently (one system call per file?)
I've had issues doing the same once and I found that it's best to just run a find command and move each file individually.
cd /origin/path
find . | cpio -updm ../destination/
-u command will override the existing files

Why is rsync daemon truncating this path?

I'm trying to synchronize a set of remote files via an rsync daemon, but the resulting path is missing the initial path element.
$ rsync -HRavP ftp.ncbi.nih.gov::refseq/H_sapiens/README 2015-05-11/
receiving incremental file list
created directory 2015-05-11
H_sapiens/
H_sapiens/README
4,850 100% 4.63MB/s 0:00:00 (xfr#1, to-chk=0/2)
sent 51 bytes received 5,639 bytes 3,793.33 bytes/sec
total size is 4,850 speedup is 0.85
$ tree 2015-05-11/
2015-05-11/
└── H_sapiens
└── README
Notice that the resulting tree is missing the first part of the remote path ("refseq").
I realize that I can append the first element of the remote path to the destination path, but it seems unlikely (to me) that this is the intended behavior of rsync.
It's worth noting for comparison that rsync -HRavP refseq/H_sapiens/README 2015-05-11/ (where the source is a local file) correctly creates the full relative path under the destination directory.
See rsync description:
CONNECTING TO AN RSYNC SERVER
...
Using rsync in this way is the same as using it with rsh or ssh except that:
You use a double colon :: instead of a single colon to separate the hostname from the path.
The first word of the "path" is actually a module name.
You can get all module names with
rsync -HRavP ftp.ncbi.nih.gov::

Rsync takes longer time “receiving incremental file list”

I am using rysnc to copy files from remote host to local machine using a cron job. Every time I need the rsync to copy new files only from remote host. But its getting struck at this line "receiving incremental file list" for very long time. Below is the command I am using. Is there any other way I can fasten up this rsync process?
rsync -avz --inplace --progress --delete -ahe ssh remoteuser#remotehost:/home/bin/dir1/data /home/bin/dir1
Have you tried with --delete-before, --delete-after or --delay-updates?
Some options require rsync to know the full file list, so these options
disable the incremental recursion mode. These include: --delete-before, --delete-after, --prune-empty-dirs, and --delay-updates. Because of this, the default delete mode when you specify --delete is now --delete-during when both ends of the connection are at least 3.0.0 (use --del or --delete-during to request this improved deletion mode explicitly). See also the --delete-delay option that is a better choice than using --delete-after.
(from: http://linux.die.net/man/1/rsync)

inotify and rsync on large number of files

I am using inotify to watch a directory and sync files between servers using rsync. Syncing works perfectly, and memory usage is mostly not an issue. However, recently a large number of files were added (350k) and this has impacted performance, specifically on CPU. Now when rsync runs, CPU usage spikes to 90%/100% and rsync takes long to complete, there are 650k files being watched/synced.
Is there any way to speed up rsync and only rsync the directory that has been changed? Or alternatively to set up multiple inotifywaits on separate directories. Script being used is below.
UPDATE: I have added the --update flag and usage seems mostly unchanged
#! /bin/bash
EVENTS="CREATE,DELETE,MODIFY,MOVED_FROM,MOVED_TO"
inotifywait -e "$EVENTS" -m -r --format '%:e %f' /var/www/ --exclude '/var/www/.*cache.*' | (
WAITING="";
while true; do
LINE="";
read -t 1 LINE;
if test -z "$LINE"; then
if test ! -z "$WAITING"; then
echo "CHANGE";
WAITING="";
rsync --update -alvzr --exclude '*cache*' --exclude '*.git*' /var/www/* root#secondwebserver:/var/www/
fi;
else
WAITING=1;
fi;
done)
I ended up removing the compression option (z) and upping the WAITING var to 10 (seconds). This seems to have helped, rsync still spikes CPU load but it is shorter lived. Credit goes to an answer on unix stackexchange
You're using rsync to synchronize the root directory of a large tree, so I'm not surprised at the performance loss.
One possible solution is to only synchronize the changed files/directories, instead of the whole root directory.
For instance, file1, file2 and file3 lay under from/dir. When changes are made to these 3 files, use
rsync --update -alvzr from/dir/file1 from/dir/file2 from/dir/file3 to/dir
rather than
rsync --update -alvzr from/dir/* to/dir
But this has a potential bug: rsync won't create directories automatically if target folders don't exist. However, you can use ssh to execute remote command and create directories by yourself.
You may need to set SSH public-key authentication as well, but according to the rsync command line you paste, I assume you've already done this.
reference:
rsync - create all missing parent directories?
rsync: how can I configure it to create target directory on server?
How to use SSH to run a shell script on a remote machine?
SSH error when executing a remote command: "stdin: is not a tty"

How to restrict Rsync update timestamp

rsync -av --size-only --include="*/" --include="*.jpeg" --exclude="*" ~/alg/temperature/ ~/alg/tmp/
I use command as above to sync some files, and I don't want to update anything even timestamp if file size is the same
the option --size-only could only sync the file which changed in size
but those which no change in size will be "touched" and update the timestamp, this is what I don't want
how could I make it?
The -a option is equivalent to -rlptgoD. You need to remove the -t. -t tells rsync to transfer modification times along with the files and update them on the remote system.
You may want to try the -c skip based on checksum, not mod-time & size. This is slower, but should work for what you want.
So your line could be (by expanding a and replacing t with c):
rsync -rlpcgoDv --include="*/" --include="*.jpeg" --exclude="*" ~/alg/temperature/ ~/alg/tmp/

Resources