How to sync files matching glob pattern like `foo*/bar*/yaz*.gz` - rsync

I'm trying to sync a set of files which are found by the linux fileglob foo*/bar*/yaz*.gz.
The standard rsync trick of --include='yaz*.gz' --include='*/' --exclude='*' --prune-empty-dirs does work, but it is quite slow since there are many other uninteresting directories and files that it finds.

Related

Folderstructure with rsync in bash

I looked up the forum but didn't find an article which matches my problem. Maybe there is some, and you can help me out with it.
My problem is I want to sync an folder with the command rsync -a -v. The point is I got 5 different Maschinen. On every maschine is a scratch folder I want to sync into the folder: ~/work_dir/scratch_maschines and inside the /scratch_maschines folder should be a folder for maschine_a, maschine_b and so on.
On the maschines it is always the same path: /scratch/my_name. So when I use now this command for the first two maschines:
rsync -a -v --exclude='*.chk' --exclude='*.rwf' --exclude='*.fchk' --delete sp02:/scratch/my_name ~/work_dir/scratch_maschine01; rsync -a -v --exclude='*.chk' --exclude='*.rwf' --exclude='*.fchk' --delete maschine02:/scratch/my_name ~/work_dir/scratch_maschine02
I got a folders for scratch_maschine01 and scratch_maschine02 in my working directory but inside these folders are not direct my data there is first a folder inside with my_name and this folder contains the data. So my question is how can I use the rsync command and get the files from the scratch directorys straight to the folders for each machine?
You might want to consider reformulating your commands similar to the following:
START=`pwd`
EXCLUDES="--exclude='*.chk' --exclude='*.rwf' --exclude='*.fchk'"
{ SOURCE="sp02:/scratch/my_name"
REMOTE="${HOME}/work_dir/scratch_maschine01"
cd "${SOURCE}"
rsync --recursive -v --delete ${EXCLUDES} "./" "${REMOTE}/"
}>${START}/job.log 2>${START}/job.err
The key elements there are
the --recursive which will rsync will expand to include all content and subdirs of the SOURCE directory.
the / behind the ${SOURCE} notifies rsync to limit itself to content of the SOURCE directory, but not the directory itself.
the / behind the ${REMOTE} notifies rsync to limit itself to depositing content into that directory and expect it to already exist, to specifically fail if that does not already exist at REMOTE; this ensures that the remote site doesn't attempt a failsafe PWD and deposit files elsewhere than expected.
The above approach lends itself to a function form that could be placed into a loop with pre-attempt condition checks, along with having a complementary case for all variable assignments grouped under a destination heading (i.e. case statements).
Using such an approach with meaningful labels for variables lends itself to a type of implicit documentation, making the code more meaningful to someone not familiar with the code, as well as a refresher for yourself after a long period of not working or using the code.
I try to avoid the "~" because I prefer to always enclose definitions for variables in double quotes, to avoid issues that might arise from paths that may include unexpected characters or spaces. That way, you are sure to have your defined paths correctly interpreted by commands in scripts.
Lastly, I prefer to use the long form for the rsync options (and almost every other command) so that I don't have to refer to the manual every time to translate the single-character options when trying to understand what is coded, if the need arises for troubleshooting unexpected errors (I have always had poor memory).
My own backup command is as follows. The only reason why the
${PathMirror}${dirC}/
is not encapsulated in single quotes within the double quotes for COM is because I know those variables all evaluate to non-complex strings which cannot be misinterpreted.

Compare two directory trees

I have a btrfs-filesystem consisting of several harddrives in which is stored about 11 TB of Data. My backup consists of a NAS which exports one path via NFS. The path is then mounted on the machine with the btrfs-bilesystem and rsync is called to keep the nfs export synced to the main filesystem. I call rsync with one -v and send the results of the run to my email account to be sure everything is synchronized correctly. Now by pure chance I found out that some directories were not synchronized correctly - the directories existed on the NAS but they were empty. It is most likely not a rights issue since rsync is run as root. So it seems that in my situation rsync is not entirely trustworthy but I would like to compare the two directory trees to see if there are any files missing on the NAS and/or if there are files which dont exists on the btrfs anymore and which should have been deleted by rsync according. (I use the --delete option).
I am therefore looking for for a program or a script which can help me to check is rsync is running correctly. I don't need anything complicated like checksums, all I want to know if the NAS contains all the files in the btrfs-filesystem.
Any suggestions where to start looking?
Yours, Stefan
Run the following commands to list all files:
find /path/to/fs -type f | sort > filesystem.txt
find /path/to/nfs -type f |sort > nfs.txt
Then compare the lists:
diff -u filesystem.txt nfs.txt

synchronise local directories over ssh

The following command works great for me for a single file:
scp your_username#remotehost.edu:foobar.txt /some/local/directory
What I want to do is do it recursive (i.e. for all subdirectories / subfiles of a given path on server), merge folders and overwrite files that already exist locally, and finally downland only those files on server that are smaller than a certain value (e.g. 10 mb).
How could I do that?
Use rsync.
Your command is likely to look like this:
rsync -az --max-size=10m your_username#remotehost.edu:foobar.txt /some/local/directory
-a (archive mode - the sync is recursive, transfers ownership, attributes, symlinks among other things)
-z (compresses transfer)
--max-size (only copies files up to a certain size)
There are many more flags which may be suitable. Checkout the docs for more details - http://linux.die.net/man/1/rsync
First option: use rsync.
Second option, and it's not going to be a one liner, but can be done in three or four lines:
Create a tar archive on the remote system using ssh.
Copy the tar from remote system with scp.
Untar the archive locally.
If the creation of the archive gets a bit complicated and involves using find and/or tar with several options it is quite practical to create a script which would do that locally, upload it on the server with scp, and only then execute remotely with ssh.

Making multiple files from multiple files with one command in gnu make

Assume 1000 files with extension .xhtml are in directory input, and that a certain subset of those files (with output paths in $(FILES), say) need to be transformed via xslt to files with the same name in directory output. A simple make rule would be:
$(FILES): output/%.xhtml : input/%.xhtml
saxon s:$< o:$# foo.xslt
This works, of course, doing the transform one file at a time. The problem is that I want to use saxon's batch processing to do all the files at one time, since, given the number of files, that would be much faster, considering the overhead of loading java and saxon for each file. Saxon allows the -s (source) option to be a directory and processes all files in that directory, placing the results with the same name in the directory specified in the -o: option.
I'm aware of the well-known technique to get GNU make to do a single command to update multiple files by using pattern rules:
output/%.xhtml: input/%.xhtml
saxon s:input -o:output foo.xslt
But in my case this suffers from two problems. First, it will run the transform on all files in the input directory, not just the ones that have changed; and second, it will not limit the transform to the subset of files specified in $(FILES). The GNU make feature of running a recipe given in a pattern rule only once for all matched targets does not work in the case of so-called "static pattern rules" (see [here]), as the rule given at the top of the post is known.
In order to use the saxon batching feature, I need to create a temporary directory, copy to it only those files to be processed, then run the transform with that temporary directory as the input directory. I tried creating a temporary directory, and remember its name using a target-specific variable for future use, using
$(FILES): TMPDIR:=$(shell mktemp -d)
but this creates a new temporary directory for every single target that is out-of-date. In any case, I'm not sure how to structure the rule that would then copy the necessary files into that directory. I don't want to create the temporary directory at the time the makefile is parsed, since I have a non-recursive make system that will parse all make files, even those not related to the current top-level target, and don't want to create the temporary directory for situations in which it is not necessary/will not be used.
I'm well aware that many questions have been asked on SO in the past about creating multiple files from a single input; one solution is (non-static) pattern rules; other solutions involve phony targets. However, in this case I'm stuck as to how to put all this together.
I can identify the files that changed and copy them using the static pattern rule
$(FILES): output/%.xhtml : input/%.xhtml
TMPDIR=`mktemp -d`
cp $< $(TMPDIR)
but actually I would prefer to copy the files with a single cp command, whereas this copies them one by one. Perhaps there is some application here of cp -u?
I also considered using an ad-hoc extension for those files needing updating but could not see how to get this to work either. I'm about to give up and just run the saxon transform on all files when any of them have changed, but is there any better way?
Personally, I wouldn't try to do this from the command line. That's partly because I'm not a shell scripting wizard. I'm not an Ant wizard either, but because the requirement is to process files that haven't changed, this seems to fall very much into Ant territory. On the other hand, Ant will recompile the stylesheet for each transformation, which is an overhead you might want to avoid; if that's the case then your best bet is probably to write a little Java application. It's probably only 100 lines or less.
Final possibility is to do the processing within Saxon: that is, a single transformation that reads multiple input files using the collection() function and generates multiple result files using xsl:result-document. Saxon (commercial editions) offers an extension function last-modified that allows you to filter the files to be processed. With 1000 files you might also want the extension function saxon:discard-document() to prevent the heap filling.
Personally, I like your original one-compiler-per-file formulation. Does not this work well with make's -j n flag?
You can of course batch up files by copying, and then running saxon at the end. Recursive make (ugh!) can sort out the ordering. Something like:
.PHONY: all
all:
rm -rf tmpdir
${MAKE} tmpdir/sentinel
saxon -s:tmpdir -o:output foo.xslt
tmpdir/sentinel: $(FILES) ; touch $#
$(FILES): output/%.xhtml: input/%.xhtml
ln $< $(patsubst input/%,tmpdir/%,$<)
This does work, though I am very queasy about lying to make (the static pattern rule purports to create the target in output/, but in fact does its dirty deed in tmpdir/).
Note in the recipe for tmpdir/sentinel, that $? is correctly set to the list of output files that are out of date. This might be useful if you can pass a bunch of files to saxon rather than a folder.
I think one issue here is that 'saxon' supports either one file or all files in a directory, so isn't suitable for batch processing without copying to temporary directories.
Otherwise, this is quite simple to do by using a timestamp marker file as a proxy target. For example:
output/.timestamp : $(FILES)
mkdir -p $(#D)
$(COMMAND) -outputdir=output $?
touch $#
The three commands are:
Ensure that the output directory exists.
Run the batch command on files newer than the timestamp file.
Update the timestamp file (creating it if necessary).
Remembering that each line of a command is executed in its own subshell, and that if any command line fails, then subsequent lines are not invoked.
This approach is useful with Java builds.

Rsync previous half-copied files?

I found rsync behaves differently in the following two situations:
(1) All the files are copied by using rsync, then using rsync again will be fast (skip all the files);
(2) Use cp to copy files, then using rsync will be slow (or may be run freshly?)
So my confusion is "Does rsync generate any internal things on the files so that it can refer to avoid duplicate checking?"
rsync -a (in archive mode, which I presume you ran) retains all attributes of a file, including creation/modification time. cp does not. I suppose something in the file attributes that's different when you use cp, probably a later modification time, in the destination files, made rsync think they are newer files, so it either recopied them or had to check the contents.

Resources