Folderstructure with rsync in bash - rsync

I looked up the forum but didn't find an article which matches my problem. Maybe there is some, and you can help me out with it.
My problem is I want to sync an folder with the command rsync -a -v. The point is I got 5 different Maschinen. On every maschine is a scratch folder I want to sync into the folder: ~/work_dir/scratch_maschines and inside the /scratch_maschines folder should be a folder for maschine_a, maschine_b and so on.
On the maschines it is always the same path: /scratch/my_name. So when I use now this command for the first two maschines:
rsync -a -v --exclude='*.chk' --exclude='*.rwf' --exclude='*.fchk' --delete sp02:/scratch/my_name ~/work_dir/scratch_maschine01; rsync -a -v --exclude='*.chk' --exclude='*.rwf' --exclude='*.fchk' --delete maschine02:/scratch/my_name ~/work_dir/scratch_maschine02
I got a folders for scratch_maschine01 and scratch_maschine02 in my working directory but inside these folders are not direct my data there is first a folder inside with my_name and this folder contains the data. So my question is how can I use the rsync command and get the files from the scratch directorys straight to the folders for each machine?

You might want to consider reformulating your commands similar to the following:
START=`pwd`
EXCLUDES="--exclude='*.chk' --exclude='*.rwf' --exclude='*.fchk'"
{ SOURCE="sp02:/scratch/my_name"
REMOTE="${HOME}/work_dir/scratch_maschine01"
cd "${SOURCE}"
rsync --recursive -v --delete ${EXCLUDES} "./" "${REMOTE}/"
}>${START}/job.log 2>${START}/job.err
The key elements there are
the --recursive which will rsync will expand to include all content and subdirs of the SOURCE directory.
the / behind the ${SOURCE} notifies rsync to limit itself to content of the SOURCE directory, but not the directory itself.
the / behind the ${REMOTE} notifies rsync to limit itself to depositing content into that directory and expect it to already exist, to specifically fail if that does not already exist at REMOTE; this ensures that the remote site doesn't attempt a failsafe PWD and deposit files elsewhere than expected.
The above approach lends itself to a function form that could be placed into a loop with pre-attempt condition checks, along with having a complementary case for all variable assignments grouped under a destination heading (i.e. case statements).
Using such an approach with meaningful labels for variables lends itself to a type of implicit documentation, making the code more meaningful to someone not familiar with the code, as well as a refresher for yourself after a long period of not working or using the code.
I try to avoid the "~" because I prefer to always enclose definitions for variables in double quotes, to avoid issues that might arise from paths that may include unexpected characters or spaces. That way, you are sure to have your defined paths correctly interpreted by commands in scripts.
Lastly, I prefer to use the long form for the rsync options (and almost every other command) so that I don't have to refer to the manual every time to translate the single-character options when trying to understand what is coded, if the need arises for troubleshooting unexpected errors (I have always had poor memory).
My own backup command is as follows. The only reason why the
${PathMirror}${dirC}/
is not encapsulated in single quotes within the double quotes for COM is because I know those variables all evaluate to non-complex strings which cannot be misinterpreted.

Related

copy with rsync when files are different

I have to copy a big directory to my NAS using rsync, I would like to say to rsync only copy the files when source and destination are different to avoid to copy a files already copied.
Skipping identical files is the whole purpose why people use rsync. This is default behavior of rsync. Most of the time the only option you want to use is -a:
rsync -a -P <source> <dest>
The -P just means show progress and the -a means "archive" and that means "when copying files, try to make copy as identical as possible" (try to keep permissions, ownership, timestamps, etc.) but is also means "Only update files if you have to". It's like saying "make sure <dest> is an up-to-date backup of <source>".
However, by default rsync will already consider two files identical, if they have same file size and same last modification date. Of course, two files may also have same size and same last modification date and not be identical. So when running that command for the very first time and you are not sure which files may need update and which ones don't, try this:
rsync -a -c -P <source> <dest>
-c means don't rely just upon size and date, checksum every file and compare the checksums. Only if checkums are identical, consider files as identical. Note that rsync will not necessary checksum the whole file, big files are broken into smaller chunks and every chunk is checksumed separately as only chunks that have changed are transferred.
So even with checksuming you can save you a lot of time when copying over a network connection. It won't save you any time when copying locally because just copying everything is probably faster than checksuming everything. So a plain copy will always beat a checksuming rsync in speed when both, source and destination, are local drives. In that case use
cp -a -v <source> <dest>
or if your system doesn't know -a, use
cp -pPR -v <source> <dest>
that's identical to -a. Again, the -v is just to see some progress.
And I'd only use -c for the very first sync, after that, relying on file size and last modification date usually works very well for updating and it is a whole lot faster. It will work because if a file has been altered since the last sync, it will have a different last modification date and so by just comparing the dates rysnc will know that the file must be updated at the destination. Of course, that only works if your systems all have the correct date/time set and if you don't manipulate the last modification date of files and also don't forbid your system to update them.
If you want to skip files solely on presence, use this:
rsync -a -P --ignore-existing <source> <dest>
That's like telling rsync "If you see a file with the same name at the destination, always consider it to be identical and never update it".
Please note that if -a detects a file in <source> is different than a files in <dist>, whether this is determined by size and modification date or by checksumming, it will always update the file at <dest> to match then file at <source>. If multiple sources are syncing to the same destination, you might also want to add -u which means "in case two files are different, only update if the file at <source> has a newer last modification date than then file at <dest>"
Just as a general tip, if you type
man <command>
in a terminal, you will get a nice help page on most systems (Linux, MacOS X and UNIX systems), explaining you all the options in all detail. You can scroll up/down using arrow keys or page up/down and you can leave that view by hitting "q" for quit. E.g.
man rsync

Compare two directory trees

I have a btrfs-filesystem consisting of several harddrives in which is stored about 11 TB of Data. My backup consists of a NAS which exports one path via NFS. The path is then mounted on the machine with the btrfs-bilesystem and rsync is called to keep the nfs export synced to the main filesystem. I call rsync with one -v and send the results of the run to my email account to be sure everything is synchronized correctly. Now by pure chance I found out that some directories were not synchronized correctly - the directories existed on the NAS but they were empty. It is most likely not a rights issue since rsync is run as root. So it seems that in my situation rsync is not entirely trustworthy but I would like to compare the two directory trees to see if there are any files missing on the NAS and/or if there are files which dont exists on the btrfs anymore and which should have been deleted by rsync according. (I use the --delete option).
I am therefore looking for for a program or a script which can help me to check is rsync is running correctly. I don't need anything complicated like checksums, all I want to know if the NAS contains all the files in the btrfs-filesystem.
Any suggestions where to start looking?
Yours, Stefan
Run the following commands to list all files:
find /path/to/fs -type f | sort > filesystem.txt
find /path/to/nfs -type f |sort > nfs.txt
Then compare the lists:
diff -u filesystem.txt nfs.txt

mkdir's "-p" option

So this doesn't seem like a terribly complicated question I have, but it's one I can't find the answer to. I'm confused about what the -p option does in Unix. I used it for a lab assignment while creating a subdirectory and then another subdirectory within that one. It looked like this:
mkdir -p cmps012m/lab1
This is in a private directory with normal rights (rlidwka). Oh, and would someone mind giving a little explanation of what rlidwka means? I'm not a total noob to Unix, but I'm not really familiar with what this means. Hopefully that's not too vague of a question.
The man pages is the best source of information you can find... and is at your fingertips: man mkdir yields this about -p switch:
-p, --parents
no error if existing, make parent directories as needed
Use case example: Assume I want to create directories hello/goodbye but none exist:
$mkdir hello/goodbye
mkdir:cannot create directory 'hello/goodbye': No such file or directory
$mkdir -p hello/goodbye
$
-p created both, hello and goodbye
This means that the command will create all the directories necessaries to fulfill your request, not returning any error in case that directory exists.
About rlidwka, Google has a very good memory for acronyms :). My search returned this for example: http://www.cs.cmu.edu/~help/afs/afs_acls.html
Directory permissions
l (lookup)
Allows one to list the contents of a directory. It does not allow the reading of files.
i (insert)
Allows one to create new files in a directory or copy new files to a directory.
d (delete)
Allows one to remove files and sub-directories from a directory.
a (administer)
Allows one to change a directory's ACL. The owner of a directory can always change the ACL of a directory that s/he owns, along with the ACLs of any subdirectories in that directory.
File permissions
r (read)
Allows one to read the contents of file in the directory.
w (write)
Allows one to modify the contents of files in a directory and use chmod on them.
k (lock)
Allows programs to lock files in a directory.
Hence rlidwka means: All permissions on.
It's worth mentioning, as #KeithThompson pointed out in the comments, that not all Unix systems support ACL. So probably the rlidwka concept doesn't apply here.
-p|--parent will be used if you are trying to create a directory with top-down approach. That will create the parent directory then child and so on iff none exists.
-p, --parents
no error if existing, make parent directories as needed
About rlidwka it means giving full or administrative access. Found it here https://itservices.stanford.edu/service/afs/intro/permissions/unix.
mkdir [-switch] foldername
-p is a switch, which is optional. It will create a subfolder and a parent folder as well, even if parent folder doesn't exist.
From the man page:
-p, --parents no error if existing, make parent directories as needed
Example:
mkdir -p storage/framework/{sessions,views,cache}
This will create subfolder sessions,views,cache inside framework folder irrespective of whether 'framework' was available earlier or not.
PATH: Answered long ago, however, it maybe more helpful to think of -p as "Path" (easier to remember), as in this causes mkdir to create every part of the path that isn't already there.
mkdir -p /usr/bin/comm/diff/er/fence
if /usr/bin/comm already exists, it acts like:
mkdir /usr/bin/comm/diff
mkdir /usr/bin/comm/diff/er
mkdir /usr/bin/comm/diff/er/fence
As you can see, it saves you a bit of typing, and thinking, since you don't have to figure out what's already there and what isn't.
Note that -p is an argument to the mkdir command specifically, not the whole of Unix. Every command can have whatever arguments it needs.
In this case it means "parents", meaning mkdir will create a directory and any parents that don't already exist.

Making multiple files from multiple files with one command in gnu make

Assume 1000 files with extension .xhtml are in directory input, and that a certain subset of those files (with output paths in $(FILES), say) need to be transformed via xslt to files with the same name in directory output. A simple make rule would be:
$(FILES): output/%.xhtml : input/%.xhtml
saxon s:$< o:$# foo.xslt
This works, of course, doing the transform one file at a time. The problem is that I want to use saxon's batch processing to do all the files at one time, since, given the number of files, that would be much faster, considering the overhead of loading java and saxon for each file. Saxon allows the -s (source) option to be a directory and processes all files in that directory, placing the results with the same name in the directory specified in the -o: option.
I'm aware of the well-known technique to get GNU make to do a single command to update multiple files by using pattern rules:
output/%.xhtml: input/%.xhtml
saxon s:input -o:output foo.xslt
But in my case this suffers from two problems. First, it will run the transform on all files in the input directory, not just the ones that have changed; and second, it will not limit the transform to the subset of files specified in $(FILES). The GNU make feature of running a recipe given in a pattern rule only once for all matched targets does not work in the case of so-called "static pattern rules" (see [here]), as the rule given at the top of the post is known.
In order to use the saxon batching feature, I need to create a temporary directory, copy to it only those files to be processed, then run the transform with that temporary directory as the input directory. I tried creating a temporary directory, and remember its name using a target-specific variable for future use, using
$(FILES): TMPDIR:=$(shell mktemp -d)
but this creates a new temporary directory for every single target that is out-of-date. In any case, I'm not sure how to structure the rule that would then copy the necessary files into that directory. I don't want to create the temporary directory at the time the makefile is parsed, since I have a non-recursive make system that will parse all make files, even those not related to the current top-level target, and don't want to create the temporary directory for situations in which it is not necessary/will not be used.
I'm well aware that many questions have been asked on SO in the past about creating multiple files from a single input; one solution is (non-static) pattern rules; other solutions involve phony targets. However, in this case I'm stuck as to how to put all this together.
I can identify the files that changed and copy them using the static pattern rule
$(FILES): output/%.xhtml : input/%.xhtml
TMPDIR=`mktemp -d`
cp $< $(TMPDIR)
but actually I would prefer to copy the files with a single cp command, whereas this copies them one by one. Perhaps there is some application here of cp -u?
I also considered using an ad-hoc extension for those files needing updating but could not see how to get this to work either. I'm about to give up and just run the saxon transform on all files when any of them have changed, but is there any better way?
Personally, I wouldn't try to do this from the command line. That's partly because I'm not a shell scripting wizard. I'm not an Ant wizard either, but because the requirement is to process files that haven't changed, this seems to fall very much into Ant territory. On the other hand, Ant will recompile the stylesheet for each transformation, which is an overhead you might want to avoid; if that's the case then your best bet is probably to write a little Java application. It's probably only 100 lines or less.
Final possibility is to do the processing within Saxon: that is, a single transformation that reads multiple input files using the collection() function and generates multiple result files using xsl:result-document. Saxon (commercial editions) offers an extension function last-modified that allows you to filter the files to be processed. With 1000 files you might also want the extension function saxon:discard-document() to prevent the heap filling.
Personally, I like your original one-compiler-per-file formulation. Does not this work well with make's -j n flag?
You can of course batch up files by copying, and then running saxon at the end. Recursive make (ugh!) can sort out the ordering. Something like:
.PHONY: all
all:
rm -rf tmpdir
${MAKE} tmpdir/sentinel
saxon -s:tmpdir -o:output foo.xslt
tmpdir/sentinel: $(FILES) ; touch $#
$(FILES): output/%.xhtml: input/%.xhtml
ln $< $(patsubst input/%,tmpdir/%,$<)
This does work, though I am very queasy about lying to make (the static pattern rule purports to create the target in output/, but in fact does its dirty deed in tmpdir/).
Note in the recipe for tmpdir/sentinel, that $? is correctly set to the list of output files that are out of date. This might be useful if you can pass a bunch of files to saxon rather than a folder.
I think one issue here is that 'saxon' supports either one file or all files in a directory, so isn't suitable for batch processing without copying to temporary directories.
Otherwise, this is quite simple to do by using a timestamp marker file as a proxy target. For example:
output/.timestamp : $(FILES)
mkdir -p $(#D)
$(COMMAND) -outputdir=output $?
touch $#
The three commands are:
Ensure that the output directory exists.
Run the batch command on files newer than the timestamp file.
Update the timestamp file (creating it if necessary).
Remembering that each line of a command is executed in its own subshell, and that if any command line fails, then subsequent lines are not invoked.
This approach is useful with Java builds.

Rsync previous half-copied files?

I found rsync behaves differently in the following two situations:
(1) All the files are copied by using rsync, then using rsync again will be fast (skip all the files);
(2) Use cp to copy files, then using rsync will be slow (or may be run freshly?)
So my confusion is "Does rsync generate any internal things on the files so that it can refer to avoid duplicate checking?"
rsync -a (in archive mode, which I presume you ran) retains all attributes of a file, including creation/modification time. cp does not. I suppose something in the file attributes that's different when you use cp, probably a later modification time, in the destination files, made rsync think they are newer files, so it either recopied them or had to check the contents.

Resources