I'm building software that needs to do massive amounts of file transfer via both HTTP and FTP. Often times, I get faster HTTP download with a multi-connection download accelerator like axel or lftp with pget. In some cases, I've seen 2x-3x faster file transfer using something like:
axel http://example.com/somefile
or
lftp -e 'pget -n 5 http://example.com/somefile;quit'
vs. just using wget:
wget http://example.com/somefile
But other times, wget is significantly faster than lftp. Strangly, this is even true even when I do lftp with get, like so:
lftp -e 'pget -n 1 http://example.com/somefile;quit'
I understand that downloading a file via multiple connections won't always result in a speedup, depending on how bandwidth is constrained. But: why would it be slower? Especially when calling lftp/pget with -n 1?
Is it possible that the HTTP server is compressing the stream using gzip? I can't remember if wget handles gzip Content Encoding or not. If it does, then this might explain the performance boost. Another possibility is that there is an HTTP cache somewhere in the pipeline. You can try something like
wget --no-cache --header="Accept-Encoding: identity"
and compare this to your FTP-based transfer times.
Related
From the rsync manual documentation I see that by using the option rsync-path, it is possible to specify what program is to be run on the remote machine to start up rsync. In particular, the program could be a wrapper script which calls the actual rsync command in the middle, but which does some actions before and/or after the rsync invocation. One possible interesting use would be to acquire/release a lock (e.g., a flock), so that the operations of rsync at the remote end could be co-ordinated with another process at the far end which is contending for write access to the same files. There could be multiple rsync processes simultaneously holding the shared lock (I am aware of potential for starvation but am not concerned about that right now). The 'writer' process I'm dealing with would just be changing a few hard-links, so it would not block the rsync process for any significant lengh of time.
I have looked at other co-ordination approaches, e.g., implementing a custom remote locking protocol between the client and server, but they all involve more development work and/or are unsatisfactory for other reasons, which is why I am interested in the wrapper/(f)lock approach.
My questions are:
1) Is this a reasonable way to solve the problem of co-ordinating rsync 'readers' with another, 'writer' process accessing the same directory?
2) Can you also put a wrapper around rsync when using the inetd (or xinetd) daemon approach to running rsync, by adding a line something like the following to /etc/inetd.conf (as per the rsyncd.conf man page):
rsync stream tcp nowait root /usr/bin/rsync rsyncd --daemon
but replacing /usr/bin/rsync with the path to your rsync-lookalike wrapper, which in this case would be a C/C++ -code program which seizes a lock, forks off rsync, waits for rsync to complete, then releases the lock.
Thanks,
Tom
One potential catch with the wrapper approach: the remote process seems to be called with extra arguments, which are appended to whatever command line you specify with --rsync-path. So if you need to pass arguments something like the following style is needed.
#! /bin/sh
lock_target=$1
shift
if ! lockfile ${lock_target}.lock ; then exit 1 ; fi
trap "rm -f ${lock_target}.lock" EXIT HUP TERM INT
/usr/bin/rsync "$#"
Thanks to the question and the comments. Armed with your ideas I solved it (for me) using --rsync-path but without any wrapper scrips on the remote host, simply by putting all payload script into --rsync-path, with a few tricks.
This particular example uses rsync to pull data from remote host while holding a flock on the remote host, e.g. remote host dumps data periodically while also holding a flock, so dump and pull must not be interleaved.
Points to note
rsync will append its arguments to the end of whatever command you specify in "--rsync-path", so command needs to cope with that, and for that I rely on bash shell features on both pulling and remote hosts.
any pre and post processing on remote host must not write to STDOUT because that will corrupt rsync protocol and rsync will bail. Any error output should go to STDERR and it will turn up on pulling host as rsync STDERR output. This is why '1>&2' in all the error handling.
this probably relies on remote command spawned by rsync to run by bash because I think the good old sh does not support arrays. This works for me between RHEL7 boxes. Possible work around proposed at the end.
With that in mind, here is my simplified concept only rehash (I've not run this particular script, my full solution has extra layers that distract attention from the main point).
The script on the pulling host:
#!/bin/bash
function rsync_wrap() {
{
flock --exclusive --timeout ${LOCK_TIMEOUT} 100 || {
echo "Failed to lock: ${LOCK_TIMEOUT}" 1>&2
return 1
}
# call real rsync with original arguments
rsync "$#"
exit_code=$?
if [ ${exit_code} -eq 0 ]; then
# Do clean up when success
# rm -f "${LOCK_FILE}"
# rm -rf /eg/purge/data
else
# Do clean up when failed
fi
# Note, return is important, do not let it fall out
return ${exit_code}
} 100<"${LOCK_FILE}"
echo "Failed to open lock file: ${LOCK_FILE}" 1>&2
return 1
}
# Define vars
LOCK_FILE=/var/somedir/name.lock; # or /dev/shm/name.lock
LOCK_TIMEOUT=600; #in seconds
# Build remote command, define vars and functions inside the command
remote_cmd="
# this approach deals with crazy chars in variables and function code
$( declare -p LOCK_FILE )
$( declare -p LOCK_TIMEOUT )
$( declare -f rsync_wrap )
rsync_wrap "
local_cmd=(
rsync
-a
--rsync-path="${remote_cmd}"
# I want to handle network timeouts in SSH, not in rsync,
# because rsync does not know that waiting for lock is expected
-e "ssh -o BatchMode=yes -o ServerAliveCountMax=3 -o ServerAliveInterval=30 ${IDENTITY_FILE:+ -i '${IDENTITY_FILE}'}"
/remote/source/path
/local/destination/path/
)
# Do it
"${local_cmd[#]}"
If remote side executes --rsync-path in something other than bash then maybe the whole remote command could be wrapped in something like:
local_cmd="bash -c '${local_cmd//\'/\'\\\'\'}'"
As per comments to the original post, it is indeed feasible to use wrapper approach to implement (f)locks around rsync at the server end.
The Rsync -u flag prevents the overwriting of modified destination files. How can I get a list of files that were not sent due to this flag? The -v flag will let me know which files were sent, but I would like to know which ones weren't.
From the rsync man page:
-i, --itemize-changes
Requests a simple itemized list of the changes that are being
made to each file, including attribute changes. This is exactly
the same as specifying --out-format='%i %n%L'. If you repeat
the option, unchanged files will also be output, but only if the
receiving rsync is at least version 2.6.7 (you can use -vv with
older versions of rsync, but that also turns on the output of
other verbose messages).
In my testing, the -ii option isn't working with rsync 3.0.8, but -vv is. Your mileage may vary.
You could also get substantially the same information by invoking rsync with --dry-run and --existing in the opposite direction. So if your regular transfer looked like this:
rsync --update --recursive local:/directory/ remote:/directory/
You would use:
rsync --dry-run --existing --recursive remote:/directory/ local:/directory/
but -vv or -ii is safer and less prone to misinterpretation.
How download only exists files with curl via commandline? I have code like this:
curl http://host.com/photos/IMG_4[200-950].jpg -u user:pass -o IMG_4#1.jpg
This command download all images from IMG_4200.jpg to IMG_4950.jpg - even if they do not exist.
use -f
(HTTP) Fail silently (no output at
all) on server errors. This is mostly
done to better enable scripts etc to
better deal with failed attempts. In
normal cases when a HTTP server fails
to deliver a document, it returns an
HTML document stating so (which often
also describes why and more). This
flag will prevent curl from outputting
that and return error 22.
This method is not fail-safe and there
are occasions where non-successful
response codes will slip through,
especially when authentication is
involved (response codes 401 and 407).
Greetings to everyone,
I'm on OSX. I use the terminal a lot as a habit from my Linux old days that I never surpassed. I wanted to download the files listed in this http server: http://files.ubuntu-gr.org/ubuntistas/pdfs/
I select them all with the mouse, put them in a txt files and then gave the following command on the terminal:
for i in `cat ../newfile`; do wget http://files.ubuntu-gr.org/ubuntistas/pdfs/$i;done
I guess it's pretty self explanatory.
I was wondering if there's any easier, better, cooler way to download this "linked" pdf files using wget or curl.
Regards
You can do this with one line of wget as follows:
wget -r -nd -A pdf -I /ubuntistas/pdfs/ http://files.ubuntu-gr.org/ubuntistas/pdfs/
Here's what each parameter means:
-r makes wget recursively follow links
-nd avoids creating directories so all files are stored in the current directory
-A restricts the files saved by type
-I restricts by directory (this one is important if you don't want to download the whole internet ;)
I have some applications, and standard Unix tools sending their output to named-pipes in Solaris, however named pipes can only be read from the local storage (on Solaris), so I can't access them from over the network or place the pipes on an NFS storage for networked access to their output.
Which got me wondering if there was an analogous way to forward the output of command-line tools directly to sockets, say something like:
mksocket mysocket:12345
vmstat 1 > mysocket 2>&1
Netcat is great for this. Here's a page with some common examples.
Usage for your case might look something like this:
Server listens for a connection, then sends output to it:
server$ my_script | nc -l 7777
Remote client connects to server on port 7777, receives data, saves to a log file:
client$ nc server 7777 >> /var/log/archive
netcat (also known as nc) is exactly what you're looking for. It's getting to be reasonably standard, but not available on all systems.
socat seems to be a beefed-up version of netcat, with lots more features, but less commonly available.
On Linux, you can also use /dev/tcp/<host>/<port>. See the Advanced Bash-Scripting Guide for more information.
netcat will help establish a pipe over the network.
You may want to use one of:
ssh: secure (encrypted), already installed out-of-the-box on Solaris - but you have to set up a keypair for non-interactive sessions
e.g. vmstat 2>&1 | ssh -i private.key oss#remote.node "cat >vmstat.out"
netcat: simple to set up - but insecure and open to attacks
see http://www.debian-administration.org/articles/58 etc.
Everyone is on the right track with netcat. But I want to add that if you are piping into nc and expecting a response, you will need to use the -q <seconds> option. From the manual:
-q seconds
after EOF on stdin, wait the specified number of seconds and then quit. If seconds is negative, wait forever.
For instance, if you want to interact with your SSH Agent you can do something like this:
echo -en '\x00\x00\x00\x01\x0b' | nc -q 1 -U $SSH_AUTH_SOCK | strings
A more complete example is at https://gist.github.com/RichardBronosky/514dbbcd20a9ed77661fc3db9d1f93e4
* I stole this from https://ptspts.blogspot.com/2010/06/how-to-use-ssh-agent-programmatically.html