I am trying to copy a directory full of directories and small files to a new server for an app migration. rsync is always my go to tool for this type of migration but this time it is not working as expected.
The directory has 174,412 files and is 136g in size. Based on this I created a 256G disk for them on the new server.
The issue is when I rsync'd the files over to the new server the new partition ran out of space before all files were copied.
I did some tests with a bigger destination disk on my test machine and when it finishes the total size on the new disk is 272G
time sudo rsync -avh /mnt/dotcms/* /data2/
sent 291.61G bytes received 2.85M bytes 51.75M bytes/sec
total size is 291.52G speedup is 1.00
df -h /data2
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/data2vg-data2lv 425G 272G 154G 64% /data2
The source is on a NAS and the new target is a XFS file system so first I thought it may be a block size issue. But then I used the cp command and it copied the exact same size.
time sudo cp -av /mnt/dotcms/* /data
df -h /data2
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/data2vg-data2lv 425G 136G 290G 32% /data2
Why is rsync increasing the space used?
According to the documentation, dotcms makes use of hard links. So, you need to give rsync the -H option to preserve them. Note that GNU's cp -av preserves hard links so doesn't have this problem.
Other rsync options you should consider using include:
-H, --hard-links : preserve hard links
-A, --acls : preserve ACLs (implies --perms)
-X, --xattrs : preserve extended attributes
-S, --sparse : turn sequences of nulls into sparse blocks
--delete : delete extraneous files from destination dirs
This assumes you are running as root and that the destination is supposed to have the same users/groups as the source. If the users and groups are not the same, then #Cyrus' alternative commandline using --numeric-id may be more appropriate.
I'm trying to sync directories within the same machine, basically copying files from one directory to another directory.
under certain circumstances, the write permission of the destination files will be removed to protect them. However, rsync command seems to ignore the lack of write permission and overwrite all the files in the destination anyway. Any idea why?
Command used(all have the same problem):
$ rsync -azv --delete source/ destination/
$ rsync -azv source/ destination/
version:
rsync version 2.6.9 protocol version 29
destination file permission: -r--r--r--,
source file permission: -rwxrwxrwx,
destination file owner: same owner(not root though),
output:
building file list ... done
sent 101 bytes received 26 bytes 254.00 bytes/sec
total size is 1412 speedup is 11.12
resulting destination file: -rwxrwxrwx
OS:
both macOS(latest) and redhat linux
I need to transfer the Files from remote Linux server to directly HDFS.
I have keytab placed on remote server , after kinit command its activated however i cannot browse the HDFS folders. I know from edge nodes i can directly copy files to HDFS however i need to skip the edge node and directly transfer the files to HDFS.
how can we achieve this.
Let's assume a couple of things first. You have one machine on which the external hard drive is mounted (named DISK) and one cluster of machines with an ssh access to the master (we denote by master in the command line the user#hostname part of the master machine). You run the script on the machine with the drive. The data on the drive consists of multiple directories with multiple files in each (like a 100); the numbers don't matter, it's just to justify the loops. The path to the data will be stored in the ${DIR} variable (on Linux, it would be /media/DISK and on Mac OS X /Volumes/DISK). Here is what the script looks like:
DIR=/Volumes/DISK;
for d in $(ls ${DIR}/);
do
for f in $(ls ${DIR}/${d}/);
do
cat ${DIR}/${d}/${f} | ssh master "hadoop fs -put - /path/on/hdfs/${d}/${f}";
done;
done;
Note that we go over each file and we copy it into a specific file because the HDFS API for put requires that "when source is stdin, destination must be a file."
Unfortunately, it takes forever. When I came back the next morning, it only did a fifth of the data (100GB) and was still running... Basically taking 20 minutes per directory! I ended up going forward with the solution of copying the data temporarily on one of the machines and then copying it locally to HDFS. For space reason, I did it one folder at a time and then deleting the temporarily folder immediately after. Here is what the script looks like:
DIR=/Volumes/DISK;
PTH=/path/on/one/machine/of/the/cluster;
for d in $(ls ${DIR}/);
do
scp -r -q ${DIR}/${d} master:${PTH}/
ssh master "hadoop fs -copyFromLocal ${PTH}/${d} /path/on/hdfs/";
ssh master "rm -rf ${PTH}/${d}";
done;
Hope it helps!
I need to move 90 million files from a NFS folder to a second NFS folder, both connections to NFS folder are using same eth0, which is 1Gb/s to the NFS servers, Sync is not needed, only move (overwrite if it exists). I think my main problem is the number of files, not the total size. The best way should be the way with less system calls per file to the NFS folders.
I tried cp, rsync, and finally http://moo.nac.uci.edu/~hjm/parsync/ parsync first took 10 hours in generate 12 GB gzip of the file list, after it took 40 hours and no one file was copied, it was working to 10 threads until I canceled it and started debugging, I found it is doing a call (stat ?) again to each file (from the list) with the -vvv option (it uses rsync):
[sender] make_file(accounts/hostingfacil/snap.2017-01-07.041721/hostingfacil/homedir/public_html/members/vendor/composer/62ebc48e/vendor/whmcs/whmcs-foundation/lib/Domains/DomainLookup/Provider.php,*,0)*
the parsync command is:
time parsync --rsyncopts="-v -v -v" --reusecache --NP=10 --startdir=/nfsbackup/folder1/subfolder2 thefolder /nfsbackup2/folder1/subfolder2
Each rsync has this form:
rsync --bwlimit=1000000 -v -v -v -a --files-from=/root/.parsync/kds-chunk-9 /nfsbackup/folder1/subfolder2 /nfsbackup2/folder1/subfolder2
The NFS folders are mounted:
server:/export/folder/folder /nfsbackup2 nfs auto,noexec,noatime,nolock,bg,intr,tcp,actimeo=1800,nfsvers=3,vers=3 0 0
Any idea how to instruct the rsync to copy the files already in the list from the nfs to the nfs2 folder? Or any way to make this copy efficiently (one system call per file?)
I've had issues doing the same once and I found that it's best to just run a find command and move each file individually.
cd /origin/path
find . | cpio -updm ../destination/
-u command will override the existing files
I've about 50 or so files in various sub-directories that I'd like to push to a remote server. I figured rsync would be able to do this for me using the --include-from option. Without the --exclude="*" option, all the files in the directory are being synced, with the option, no files are.
rsync -avP -e ssh --include-from=deploy/rsync_include.txt --exclude=* ./ root#0.0.0.0:/var/www/ --dry-run
I'm running it as dry initially and 0.0.0.0 is obviously replaced by the IP of the remote server. The contents of rsync_include.txt is a new line separated list of relative paths to the files I want to upload.
Is there a better way of doing this that is escaping me on a Monday morning?
There is a flag --files-from that does exactly what you want. From man rsync:
--files-from=FILE
Using this option allows you to specify the exact list of files to transfer (as read from the specified FILE or - for standard input). It also tweaks the default behavior of rsync to make transferring just the specified files and directories easier:
The --relative (-R) option is implied, which preserves the path information that is specified for each item in the file (use --no-relative or --no-R if you want to turn that off).
The --dirs (-d) option is implied, which will create directories specified in the list on the destination rather than noisily skipping them (use --no-dirs or --no-d if you want to turn that off).
The --archive (-a) option’s behavior does not imply --recursive (-r), so specify it explicitly, if you want it.
These side-effects change the default state of rsync, so the position of the --files-from option on the command-line has no bearing on how other options are parsed (e.g. -a works the same before or after --files-from, as does --no-R and all other options).
The filenames that are read from the FILE are all relative to the source dir -- any leading slashes are removed and no ".." references are allowed to go higher than the source dir. For example, take this command:
rsync -a --files-from=/tmp/foo /usr remote:/backup
If /tmp/foo contains the string "bin" (or even "/bin"), the /usr/bin directory will be created as /backup/bin on the remote host. If it contains "bin/" (note the trailing slash), the immediate contents of the directory would also be sent (without needing to be explicitly mentioned in the file -- this began in version 2.6.4). In both
cases, if the -r option was enabled, that dir’s entire hierarchy would also be transferred (keep in mind that -r needs to be specified explicitly with --files-from, since it is not implied by -a). Also note that the effect of the (enabled by default) --relative option is to duplicate only the path info that is read from the file -- it
does not force the duplication of the source-spec path (/usr in this case).
In addition, the --files-from file can be read from the remote host instead of the local host if you specify a "host:" in front of the file (the host must match one end of the transfer). As a short-cut, you can specify just a prefix of ":" to mean "use the remote end of the transfer". For example:
rsync -a --files-from=:/path/file-list src:/ /tmp/copy
This would copy all the files specified in the /path/file-list file that was located on the remote "src" host.
If the --iconv and --protect-args options are specified and the --files-from filenames are being sent from one host to another, the filenames will be translated from the sending host’s charset to the receiving host’s charset.
NOTE: sorting the list of files in the --files-from input helps rsync to be more efficient, as it will avoid re-visiting the path elements that are shared between adjacent entries. If the input is not sorted, some path elements (implied directories) may end up being scanned multiple times, and rsync will eventually unduplicate them after
they get turned into file-list elements.
For the record, none of the answers above helped except for one. To summarize, you can do the backup operation using --files-from= by using either:
rsync -aSvuc `cat rsync-src-files` /mnt/d/rsync_test/
OR
rsync -aSvuc --recursive --files-from=rsync-src-files . /mnt/d/rsync_test/
The former command is self explanatory, beside the content of the file rsync-src-files which I will elaborate down below. Now, if you want to use the latter version, you need to keep in mind the following four remarks:
Notice one needs to specify both --files-from and the source directory
One needs to explicitely specify --recursive.
The file rsync-src-files is a user created file and it was placed within the src directory for this test
The rsyn-src-files contain the files and folders to copy and they are taken relative to the source directory. IMPORTANT: Make sure there is not trailing spaces or blank lines in the file. In the example below, there are only two lines, not three (Figure it out by chance). Content of rsynch-src-files is:
folderName1
folderName2
--files-from= parameter needs trailing slash if you want to keep the absolute path intact. So your command would become something like below:
rsync -av --files-from=/path/to/file / /tmp/
This could be done like there are a large number of files and you want to copy all files to x path. So you would find the files and throw output to a file like below:
find /var/* -name *.log > file
$ date
Wed 24 Apr 2019 09:54:53 AM PDT
$ rsync --version
rsync version 3.1.3 protocol version 31
...
Syntax: rsync <args> <file_and_or_folder_list> <source_dir> <destination_dir/>
Folder names - WITH a trailing /; e.g. Cancer - Evolution/ - are provided in a file (e.g. my_folder_list):
# comment: /mnt/Vancouver/my_folder_list
# comment: 2019-04-24
some_file
another_file
Cancer/
Cancer - Evolution/
Cancer - Genomic Variants/
Cancer - Metastasis (EMT Transition ...)/
Cancer Pathways, Networks/
Catabolism - Autophagy; Phagosomes; Mitophagy/
so those are the "source" (files and/or) folders, to be rsync'd.
Note that if you don't include the trailing / shown above, rsync creates the target folders, but they are empty.
Those folder names provided in the <file_and_or_folder_list> are appended to the rest of their path: <src_dir> = /home/victoria/RESEARCH - NEWS (here, on a different partition), thus providing the complete folder path to rsync; e.g.: ... /home/victoria/RESEARCH - NEWS/Cancer - Evolution/ ...
[ I'm editing this answer some time later (2022-07), and I can't recall if the path provided to <src_dir> is /home/victoria/RESEARCH - NEWS or /home/victoria/RESEARCH - NEWS/ - providing the correct concatenated path. I believe it's the former; if it doesn't work, use the latter. ]
Note that you also need to use --files-from= ..., NOT --include-from= ...
Again the rsync syntax is:
rsync <args> <file_and_or_folder_list> <source_dir> <destination_dir/>
so,
rsync -aqP --delete --files-from=/mnt/Vancouver/my_folder_list "/home/victoria/RESEARCH - NEWS" $DEST_DIR/
where
<args> is -aqP --delete
<file_and_or_folder_list> is --files-from=/mnt/Vancouver/my_folder_list
<source_dir> is "/home/victoria/RESEARCH - NEWS"
<destination_dir/> is $DEST_DIR/ (note the trailing / added to the variable name)
In my BASH script, for coding flexibility I defined variable $DEST_DIR in two parts as follows.
BASEDIR="/mnt/Vancouver"
DEST_DIR=$BASEDIR/data
echo $DEST_DIR ## /mnt/Vancouver/data
## To clarify, here is $DEST_DIR with / appended to the variable name:
echo $DEST_DIR/ ## /mnt/Vancouver/data/
echo $DEST_DIR/apple/banana ## /mnt/Vancouver/data/apple/banana
However, you can more simply specify the destination path:
via a BASH variable: $DEST_DIR=/mnt/Vancouver/data
note that in the rsync expression above, / is appended to $DEST_DIR (i.e. $DEST_DIR/ is actually $DEST_DIR + /), giving the destination directory path /mnt/Vancouver/data/
explicitly state the destination path: /mnt/Vancouver/data/
rsync options used: ## man rsync or rsync -h
-a : archive: equals -rlptgoD (no -H,-A,-X)
-r : recursive
-l : copy symlinks as symlinks
-p : preserve permissions
-t : preserve modification times
-g : preserve group
-o : preserve owner (super-user only)
-D : same as --devices --specials
-P : same as --partial --progress
-q : quiet (https://serverfault.com/questions/547106/run-totally-silent-rsync)
--delete
This tells rsync to delete extraneous files from the RECEIVING SIDE (ones
that AREN’T ON THE SENDING SIDE), but only for the directories that are
being synchronized. You must have asked rsync to send the whole directory
(e.g. "dir" or "dir/") without using a wildcard for the directory’s contents
(e.g. "dir/*") since the wildcard is expanded by the shell and rsync thus
gets a request to transfer individual files, not the files’ parent directory.
Files that are excluded from the transfer are also excluded from being
deleted unless you use the --delete-excluded option or mark the rules as
only matching on the sending side (see the include/exclude modifiers in the
FILTER RULES section). ...
Edit: atp's answer below is better. Please use that one!
You might have an easier time, if you're looking for a specific list of files, putting them directly on the command line instead:
# rsync -avP -e ssh `cat deploy/rsync_include.txt` root#0.0.0.0:/var/www/
This is assuming, however, that your list isn't so long that the command line length will be a problem and that the rsync_include.txt file contains just real paths (i.e. no comments, and no regexps).
None of these answers worked for me, when all I had was a list of directories. Then I stumbled upon the solution! You have to add -r to --files-from because -a will not be recursive in this scenario (who knew?!).
rsync -aruRP --files-from=directory.list . ../new/location
I got similar task: to rsync all files modified after given date, but excluding some directories. It was difficult to build one liner all-in-one style, so I dived problem into smaller pieces.
Final solution:
find ~/sourceDIR -type f -newermt "DD MMM YYYY HH:MM:SS" | egrep -v "/\..|Downloads|FOO" > FileList.txt
rsync -v --files-from=FileList.txt ~/sourceDIR /Destination
First I use find -L ~/sourceDIR -type f -newermt "DD MMM YYYY HH:MM:SS". I tried to add regex to find line to exclude name patterns, however my flavor of Linux (Mint) seams not to understand negate regex in find. Tried number of regex flavors - non work as desired.
So I end up with egrep -v - option that excludes pattern easy way. My rsync is not copying directories like /.cache or /.config plus some other I explicitly named.
This answer is not the direct answer for the question.
But it should help you figure out which solution fits best for your problem.
When analysing the problem you should activate the debug option -vv
Then rsync will output which files are included or excluded by which pattern:
building file list ...
[sender] hiding file FILE1 because of pattern FILE1*
[sender] showing file FILE2 because of pattern *