How to know files that `rsync --update` won't sync? - rsync

Is there any way to know what files haven't been synced by rsync --update?
I'd like to show these filenames in the terminal or, even better, redirect them to a text file.

I answer myself: using the option -vv and grep, like this:
rsync --vv --update [maybe other options] dirA dirB | grep newer > newer_files.txt
Option -vv will tell you which files aren't being updated, reporting them on the terminal as foofile is newer.

Related

wget, recursively download all jpegs works only on website homepage

I'm using wget to download all jpegs from a website.
I searched a lot and this should be the way:
wget -r -nd -A jpg "http://www.hotelninfea.com"
This should recursively -r download files jpegs -A jpg and store all files in a single directory, without recreating website directory tree -nd
Running this command downloads only the jpegs from the homepage of the website, not the whole jpegs of all the website.
I know that a jpeg file could have different extensions (jpg, jpeg) and so on, but this is not the case, also there aren't any robots.txt restrictions acting.
If I remove the filter from the previous command, it works as expected
wget -r -nd "http://www.hotelninfea.com"
This is happening on Lubuntu 16.04 64bit, wget 1.17.1
Is this a bug or I am misunderstanding something?
I suspect that this is happening because the main page you mention contains links to the other pages in the form http://.../something.php, i.e., there is an explicit extension. Then the option -A jpeg has the "side-effect" of removing those pages from the traversal process.
Perhaps a bit dirty workaround in this particular case would be something like this:
wget -r -nd -A jpg,jpeg,php "http://www.hotelninfea.com" && rm -f *.php
i.e., to download only the necessary extra pages and then delete them if wget successfully terminates.
ewcz anwer pointed me to the right way, the --accept acclist parameter has a dual role, it define the rules of file saving and the rules of following links.
Reading deeply the manual I found this
If ‘--adjust-extension’ was specified, the local filename might have ‘.html’ appended to it. If Wget is invoked with ‘-E -A.php’, a filename such as ‘index.php’ will match be accepted, but upon download will be named ‘index.php.html’, which no longer matches, and so the file will be deleted.
So you can do this
wget -r -nd -E -A jpg,php,asp "http://www.hotelninfea.com"
But of course a webmaster could have been using custom extensions
So I think that the most robust solution would be a bash script, something
like
WEBSITE="http://www.hotelninfea.com"
DEST_DIR="."
image_urls=`wget -nd --spider -r "$WEBSITE" 2>&1 | grep '^--' | awk '{ print $3 }' | grep -i '\.\(jpeg\|jpg\)'`
for image_url in $image_urls; do
DESTFILE="$DEST_DIR/$RANDOM.jpg"
wget "$image_url" -O "$DESTFILE"
done
--spider wget will not download the pages, just check that they are there
$RANDOM asks a random number to the operating system

How to list subdirectories from an URL on a HTTP file share?

I would like to know if there is an easy name to list all files/directories from a HTTP file share - by default the HTTP server displays them but I'm wondering is there is an easy way to get the list of files without manually parsing the returned webpage.
Any solution that would use curl, wget or python should be just fine.
No, there's no generic way to do this.
wget is only designed to download files not list directories.
If that's all you've got, though...
wget -r http://SOME.SITE/PATH 2>&1 | grep 'Saving to:' | sed "s/Saving to: \`\([^?']*\).*'/\1/" | uniq -u
rm -rf SOME.SITE
(Just so you don't sue me later, this is downloading all of the files from the site and then deleting them when it's done)
Edit: Sorry, I'm tired. If you want only the top-level directories, you can do something like this:
wget -rq http://SOME.SITE/PATH
ls -1p SOME.SITE | grep '/$'
rm -rf SOME.SITE
This does the same as above, but only lists immediate subdirectories of the URL.

read input from a file and sync accordingly

I have a text file which contains the list of files and directories that I want to copy (one on a line). Now I want rsync to take this input from my text file and sync it to the destination that I provide.
I've tried playing around with "--include-from=FILE" and "--file-from=FILE" options of rsync but it is just not working
I also tried pre-fixing "+" on each line in my file but still it is not working.
I have tried coming with various filter PATTERNs as outlined in the rsync man page but it is still not working.
Could someone provide me correct syntax for this use case. I've tried above things on Fedora 15, RHEL 6.2 and Ubuntu 10.04 and none worked. So i am definitely missing something.
Many thanks.
There is more than one way to answer this question depending on how you want to copy these files. If your intent is to copy the file list with absolute paths, then it might look something like:
rsync -av --files-from=/path/to/files.txt / /destination/path/
...This would expect the paths to be relative to the source location of / and would retain the entire absolute structure under that destination.
If your goal is to copy all of those files in the list to the destination, without preserving any kind of path hierarchy (just a collection of files), then you could try one of the following:
# note this method might break if your file it too long and
# exceed the maximum arg limit
rsync -av `cat /path/to/file` /destination/
# or get fancy with xargs to batch 200 of your items at a time
# with multiple calls to rsync
cat /path/to/file | xargs -n 200 -J % rsync -av % /destination/
Or a for-loop and copy:
# bash shell
for f in `cat /path/to/files.txt`; do cp $f /dest/; done
Given a file listing $HOME/GET/bringemback containing
**need/A
alsoneed/B
shouldget/C**
cd $HOME/GET
run
rsync -av --files-from=./bringemback me#theremote:. $HOME/GET/collect
would get the files and drop them into $HOME/GET/collect
$HOME/GET/
collect/
need/A
alsoneed/B
shouldget/C
or so I believe.
helpful
rsync supports this natively:
rsync --recursive -av --files-from=/path/to/files.txt / /destination/path/

Rsync: provide a list of unsent files

The Rsync -u flag prevents the overwriting of modified destination files. How can I get a list of files that were not sent due to this flag? The -v flag will let me know which files were sent, but I would like to know which ones weren't.
From the rsync man page:
-i, --itemize-changes
Requests a simple itemized list of the changes that are being
made to each file, including attribute changes. This is exactly
the same as specifying --out-format='%i %n%L'. If you repeat
the option, unchanged files will also be output, but only if the
receiving rsync is at least version 2.6.7 (you can use -vv with
older versions of rsync, but that also turns on the output of
other verbose messages).
In my testing, the -ii option isn't working with rsync 3.0.8, but -vv is. Your mileage may vary.
You could also get substantially the same information by invoking rsync with --dry-run and --existing in the opposite direction. So if your regular transfer looked like this:
rsync --update --recursive local:/directory/ remote:/directory/
You would use:
rsync --dry-run --existing --recursive remote:/directory/ local:/directory/
but -vv or -ii is safer and less prone to misinterpretation.

How do you get rsync to exclude any directory named cache?

I'm new to rsync and have read a bit about excluding files and directories but I don't fully understand and can't seem to get it working.
I'm simply trying to run a backup of all the websites in a server's webroot but don't want any of the CMS's cache files.
Is there away to exclude any directory named cache?
I've tried a lot of things over the weeks (that I don't remember), but more recently I've been trying these sorts of things:
sudo rsync -avzO -e --exclude *cache ssh username#11.22.33.44:/home/ /Users/username/webserver-backups/DEV/home/
and this:
sudo rsync -avzO -e --exclude cache/ ssh username#11.22.33.44:/home/ /Users/username/webserver-backups/DEV/home/
and this:
sudo rsync -avzO -e --exclude */cache/ ssh username#11.22.33.44:/home/ /Users/username/webserver-backups/DEV/home/
and this:
sudo rsync -avzO -e --exclude *cache/ ssh username#11.22.33.44:/home/ /Users/username/webserver-backups/DEV/home/
Sorry if this is easy, I just haven't been able to find info that I understand because they all talk about a path to exclude.
It's just that I don't have a specific path I want to exclude - just a directory name if that makes sense.
rsync --exclude cache/ ....
should work like peaches. I think you might be confusing some things since -e requires an option (like -e "ssh -l ssh-user"). Edit on looking at your command lines a little closer, it turns out this is exactly your problem. You should have said
--exclude cache/ -e ssh
although you could just drop -e ssh since ssh is the default.
I'd also recommend that you look at the filter rules:
rsync -FF ....
That way you can include .rsync-filter files throughout your directory tree, containing things like
-cache/
This makes things way more flexible, make command lines more readable and you can make exceptions inside specific subtrees.

Resources