How does `scp` differ from `rsync`? - rsync

An article about setting up Ghost blogging says to use scp to copy from my local machine to a remote server:
scp -r ghost-0.3 root#*your-server-ip*:~/
However, Railscast 339: Chef Solo Basics uses scp to copy in the opposite direction (from the remote server to the local machine):
scp -r root#178.xxx.xxx.xxx:/var/chef .
In the same Railscast, when the author wants to copy files to the remote server (same direction as the first example), he uses rsync:
rsync -r . root#178.xxx.xxx.xxx:/var/chef
Why use the rsync command if scp will copy in both directions? How does scp differ from rsync?

The major difference between these tools is how they copy files.
scp basically reads the source file and writes it to the destination. It performs a plain linear copy, locally, or over a network.
rsync also copies files locally or over a network. But it employs a special delta transfer algorithm and a few optimizations to make the operation a lot faster. Consider the call.
rsync A host:B
rsync will check files sizes and modification timestamps of both A and B, and skip any further processing if they match.
If the destination file B already exists, the delta transfer algorithm will make sure only differences between A and B are sent over the wire.
rsync will write data to a temporary file T, and then replace the destination file B with T to make the update look "atomic" to processes that might be using B.
Another difference between them concerns invocation. rsync has a plethora of command line options, allowing the user to fine tune its behavior. It supports complex filter rules, runs in batch mode, daemon mode, etc. scp has only a few switches.
In summary, use scp for your day to day tasks. Commands that you type once in a while on your interactive shell. It's simpler to use, and in those cases rsync optimizations won't help much.
For recurring tasks, like cron jobs, use rsync. As mentioned, on multiple invocations it will take advantage of data already transferred, performing very quickly and saving on resources. It is an excellent tool to keep two directories synchronized over a network.
Also, when dealing with large files, use rsync with the -P option. If the transfer is interrupted, you can resume it where it stopped by reissuing the command. See Sid Kshatriya's answer.
Finally, note that rsync:// the protocol is similar to plain HTTP: unencrypted and no integrity checks. Be sure to always use rsync via SSH (as in the examples from the question above), not via the rsync protocol, unless you really know what you're doing. scp will always use SSH as underlying transport mechanism which has both integrity and confidentiality guarantees, so that is another difference between the two utilities.

rysnc can be useful to run on slow and unreliable connections. So if your download aborts in the middle of a large file rysnc will be able to continue from where it left off when invoked again.
Use rsync -vP username#host:/path/to/file .
The -P option preserves partially downloaded files and also shows progress.
As usual check man rsync

Difference b/w scp and rsync on different parameter
1. Performance over latency
scp : scp is relatively less optimise and speed
rsync : rsync is comparatively more optimise and speed
https://www.disk91.com/2014/technology/networks/compare-performance-of-different-file-transfer-protocol-over-latency/
2. Interruption handling
scp : scp command line tool cannot resume aborted downloads from lost network connections
rsync : If the above rsync session itself gets interrupted, you can resume it as many time as you want by typing the same command. rsync will automatically restart the transfer where it left off.
http://ask.xmodulo.com/resume-large-scp-file-transfer-linux.html
3. Command Example
scp
$ scp source_file_path destination_file_path
rsync
$ cd /path/to/directory/of/partially_downloaded_file
$ rsync -P --rsh=ssh userid#remotehost.com:bigdata.tgz ./bigdata.tgz
The -P option is the same as --partial --progress, allowing rsync to work with partially downloaded files. The --rsh=ssh option tells rsync to use ssh as a remote shell.
4. Security :
scp is more secure. You have to use rsync --rsh=ssh to make it as secure as scp.
man document to know more :
scp : http://www.manpagez.com/man/1/scp/
rsync : http://www.manpagez.com/man/1/rsync/

One major feature of rsync over scp (beside the delta algorithm and encryption if used w/ ssh) is that it automatically verifies if the transferred file has been transferred correctly. Scp will not do that, which occasionally might result in corruption when transferring larger files. So in general rsync is a copy with guarantee.
Centos manpages mention this the end of the --checksum option description:
Note that rsync always verifies that each transferred file was
correctly reconstructed on the receiving side by checking a whole-file
checksum that is generated as the file is transferred, but that
automatic after-the-transfer verification has nothing to do with this
option’s before-the-transfer “Does this file need to be updated?”
check.

There's a distinction to me that scp is always encrypted with ssh (secure shell), while rsync isn't necessarily encrypted. More specifically, rsync doesn't perform any encryption by itself; it's still capable of using other mechanisms (ssh for example) to perform encryption.
In addition to security, encryption also has a major impact on your transfer speed, as well as the CPU overhead. (My experience is that rsync can be significantly faster than scp.)
Check out this post for when rsync has encryption on.

scp is best for one file.
OR a combination of tar & compression for smaller data sets
like source code trees with small resources (ie: images, sqlite etc).
Yet, when you begin dealing with larger volumes say:
media folders (40 GB)
database backups (28 GB)
mp3 libraries (100 GB)
It becomes impractical to build a zip/tar.gz file to transfer with scp at this point do to the physical limits of the hosted server.
As an exercise, you can do some gymnastics like piping tar into ssh and redirecting the results into a remote file. (saving the need to build
a swap or temporary clone aka zip or tar.gz)
However,
rsync simplify's this process and allows you to transfer data without consuming any additional disc space.
Also,
Continuous (cron?) updates use minimal changes vs full cloned copies speed
up large data migrations over time.
tl;dr
scp == small scale (with room to build compressed files on the same drive)
rsync == large scale (with the necessity to backup large data and no room left)

it's better to think in a practical context. In our team, we use rsync -aP to replace a bad cassandra host in our cluster. We can't do this with scp (slow and no progress preservation).

Related

Transfer function that saves progress?

Does a function exist similar to scp where if the connection is lost, then the progress is saved, and resuming the process picks up where it left off? I am trying to scp a large file, and my VPN connection keeps cutting out.
Use rsync --partial. It will keep partially transferred files, which you can then resume with the same invocation. From the rsync man page:
--partial
By default, rsync will delete any partially transferred file if the transfer is interrupted. In some circumstances it is
more desirable to keep partially transferred files. Using the --partial option tells rsync to keep the partial file which
should make a subsequent transfer of the rest of the file much faster.
Try something like rsync -aivz --partial user#host:/path/to/file ~/destination/folder/
Explanation of the other switches:
a — "archive mode": make transfer recursive; preserve symlinks, permissions, timestamps, group, owner; and (where possible) preserve special and device files
i — "itemize changes": shows you what exactly is getting changed (it will be a string of all + signs if you're copying a file anew +++++++)
v — "verbose": list files as they're transferred
z — "zip": compress data during transfer
Those are just the ones I usually use to transfer files. You can see a list of all options by looking at the rsync man page.

Rsync hangs when transfering large files on an bad connection

I've been having problems transferring files over a pretty bad connection (I'm trying to upload a file on a cloud server) via rsync.
The rsync essentially hangs after about a minute or so. This is the way I'm trying to perform the upload:
rsync -avz --progress -e "ssh" ~/target.war root#my-remote-server:~/
There is no error message or other info. It simply hangs displaying something like:
7307264 14% 92.47kB/s 0:07:59
Ping-ing the remote endpoint doesn't seem to be loosing packages as far as I see.
Any help on how to overcome this problem would be nice. Thank you.
The --partial option keeps partially transferred files if the transfer is interrupted. You could use it to try again without having to transfer the whole file again.
In fact, there is the -P option, which is equivalent to --partial --progress. According to rsync's man page, "Its purpose is to make it much easier to specify these two options for a long transfer that may be interrupted."

How to deploy MPI program?

MPI require I deploy mpi program to each machine. Currently, I put the mpi program in nfs, but this method has 2 issues, one is nfs has latency issue and the other is nfs not suitable for large cluster. I know that I could use some linux shell commands to sync up my program to each node, but looks like not so convenient. especially, when I change the program frequently. Is there any easy method to to that ?
There's nothing wrong with NFS or any other network filing system in large clusters. It just means your file server isn't sized for the job. If you replace NFS with anything like ssh, ftp, scripts, or whatever and change nothing else, I don't think that'll make any significant difference. Also, if the loading time is a significant and bothersome component of the overall runtime then why use MPI in the first place?
OK, enough of playing devils advocate. One thing you can do is to have nodes load your program onto other nodes in a binary tree type arrangement. You'll need a script that will copy the executable to two other nodes along with a copy of the script, start that script running asynchronously on those nodes and then runs the executable locally. The result would be a chain reaction of copying and running spreading across the network. The only difficult bit is in choosing which nodes to copy to so that each one is visited just once. It will be a lot faster.
Depending on the nature of the application and the nature of the NFS network, using a shared file system for both the MPI implementation and the application "should" be able to scale with reasonable performance, to a point. Keep in mind that there is some NFS caching at the node level, so multiple ranks on the same node will not each have to traverse the network to reach the files.
In general terms, I tend to advise that NFS be discontinued at about 128 nodes or 1024 ranks in favor of local installations. That advice changes if the NFS is delivered with 10GigE, IPoIB, or if a high performance file system like SFS or GPFS is used.
If you are committed to local installations, then tools like rsync, or scp are good candidates to distribute the bits. Script the final result. You can even do a tar to shared, and remote command (e.g. ssh, clush) un-tar to local disc. The "solution" only needs to be robust, not polished or elegant.
I'll also chime in to say the NFS should be just fine in this use-case, unless you have a cluster of over 100-200 nodes.
If you just want a lightweight tool for doing many-node parallel operations, I'd suggest pdsh. pdsh is a very common tool on HPC clusters. It includes a command called pdcp for doing parallel node copies, i.e.
pdcp -w node[00-99] myfile /path/to/destination/myfile
Where the nodenames are node00, node01, ... node99.
Similarly, you use the pdsh command to run a command in parallel across all the nodes. I.e.,
pdsh -w node[00-99] /path/to/my/executable
Alternatively, if you're looking for something a little less ad-hoc for doing these operations, I can recommend Ansible as an easy and lightweight configuration management and deployment tool. It's not as simple to get started as pdsh, but might be more manageable in the long run...
For example, a simple Ansible playbook to copy a tarball to all nodes, extract it, and then execute a binary might look like:
---
- hosts: computenodes
user: myname
vars:
num_procs: 32
tasks:
- name: copy and extract tarball to deployment location
action: unarchive src=myapp.tar.gz dest=/path/to/deploy/
- name: execute app
action: command mpirun -np {{num_procs}} /path/to/deploy/myapp.exe

rsync --sparse does transfer whole data

I have some VM Images that need to synced everyday. The VM files are sparse'd.
To save network traffic i only want to transfer the real datas of the images.
I try it with --sparse option at rsync but on network traffic i see that the whole size get transfered over network and not only the real data usage.
If i use rsync -zv --sparse then only the real size get transmitted over network and everything is ok. But i dont want to compression the file because of the cpu usage.
Shouldnt the --sparse option transfer only real datas and the "null datas" get created locally to save network traffic?
Is there a workaround without compression?
Thanks!
Take a look a this discussion, specifically, this answer.
It seems that the solution is to do a rsync --sparse followed by a rsync --inplace.
On the first, --sparse, call, also use --ignore-existing to prevent already transferred sparse files to be overwritten, and -z to save network resources.
The second call, --inplace, should update only modified chunks. Here, compression is optional.
Also see this post.
Update
I believe the suggestions above won't solve your problem. I also believe that rsync is not the right tool for the task. You should search for other tools which will give you a good balance between network and disk I/O efficiency.
Rsync was designed for efficient usage of a single resource, the network. It assumes reading and writing to the network is much more expensive than reading and writing the source and destination files.
We assume that the two machines are connected by a low-bandwidth high-latency bi-directional communications link. The rsync algorithm, abstract.
The algorithm, summarized in four steps.
The receiving side β sends checksums of blocks of size S of the destination file B.
The sending side α identify blocks that match in the source file A, at any offset.
α sends β a list of instructions made of either verbatim, non-matching, data, or matching block references.
β reconstructs the whole file from those instructions.
Notice that rsync normally reconstructs the file B as a temporary file T, then replaces B with T. In this case it must write the whole file.
The --inplace does not relieve rsync from writing blocks matched by α, as one could imagine. They can match at different offsets. Scanning B a second time to take new data checksums is prohibitive in terms of performance. A block that matches in the same offset it was read on step one could be skipped, but rsync does not do that. In the case of a sparse file, a null block of B would match for every null block of A, and would have to be rewritten.
The --inplace just causes rsync to write directly to B, instead of T. It will rewrite the whole file.
The latest version of rsync can handle --sparse and --inplace together! I found the following github entry from 2016: https://github.com/tuna/rsync/commit/f3873b3d88b61167b106e7b9227a20147f8f6197
You could try to change the compression level to the lowest value (use the option --compress-level=1). The lowest compression level seems to be enough to reduce the traffic for sparse files. But I don't know, how the CPU usage is affected.

Sender and receiver to transfer files over ssh on request?

I created a program that iterates over a bunch of files and invokes for some of them:
scp <file> user#host:<remotefile>
However, in my case, there may be thousands of small files that need to transferred, and scp is opening a new ssh connection for each of them, which has quite some overhead.
I was wondering if there is no solution where I keep one process running that maintains the connection and I can send it "requests" to copy over single files.
Ideally, I'm looking for a combination of some sender and receiver program, such that I can start a single process (1) at the beginning:
ssh user#host receiverprogram
And for each file, I invoke a command (2):
senderprogram <file> <remotefile>
and pipe the output of (2) to the input of (1), and this would cause the file to be transferred. In the end, I can just send process (1) some signal to terminate.
Preferably the sender and receiver programs are open source C programs for Unix. They may communicate using a socket instead of a pipe, or any other creative solution.
However, it is an important constraint that each file gets transferred at the moment I iterate over it: it is not acceptable to collect a list of files and then invoke one instance of scp to transfer all the files at once at the end. Also, I have only simple shell access to the receiving host.
Update: I found a solution for the problem of the connection overhead using the multiplexing features of ssh, see my own answer below. Yet, I'm starting a bounty because I'm curious to find if there exists a sender/receiver program as I describe here. It seems there should exist something that can be used, e.g. xmodem/ymodem/zmodem?
I found a solution from another angle. Since version 3.9, OpenSSH supports session multiplexing: a single connection can carry multiple login or file transfer sessions. This avoids the set-up cost per connection.
For the case of the question, I can first open a connection with sets up a control master (-M) with a socket (-S) in a specific location. I don't need a session (-N).
ssh user#host -M -S /tmp/%r#%h:%p -N
Next, I can invoke scp for each file and instruct it to use the same socket:
scp -o 'ControlPath /tmp/%r#%h:%p' <file> user#host:<remotefile>
This command starts copying almost instantaneously!
You can also use the control socket for normal ssh connections, which will then open immediately:
ssh user#host -S /tmp/%r#%h:%p
If the control socket is no longer available (e.g. because you killed the master), this falls back to a normal connection. More information is available in this article.
This way would work, and for other things, this general approach is more or less right.
(
iterate over file list
for each matching file
echo filename
) | cpio -H newc -o | ssh remotehost cd location \&\& | cpio -H newc -imud
It might work to use sftp instead of scp, and to place it into batch mode. Make the batch command file a pipe or UNIX domain socket and feed commands to it as you want them executed.
Security on this might be a little tricky at the client end.
Have you tried sshfs?
You could:
sshfs remote_user#remote_host:/remote_dir /mnt/local_dir
Where
/remote_dir was the directory you want to send files to on the system you are sshing into
/mnt/local_dir was the local mount location
With this setup you can just cp a file into the local_dir and it would be sent over sftp to remote_host in its remote_dir
Note that there is a single connection, so there is little in the way of overhead
You may need to use the flag -o ServerAliveInterval=15 to maintain an indefinite connection
You will need to have fuse installed locally and an SSH server supporting (and configured for) sftp
May be you are looking for this:
ZSSH
zssh (Zmodem SSH) is a program for interactively transferring files to a remote machine while using the secure shell (ssh). It is intended to be a convenient alternative to scp , allowing to transfer files without having to open another session and re-authenticate oneself.
Use rsync over ssh if you can collect all the files to send in a single directory (or hierarchy of directories).
If you don't have all the files in a single place, please give some more informations as to what you want to achieve and why you can't pack all the files into an archive and send that over. Why is it so vital that each file is sent immediately? Would it be OK if the file was sent with a short delay (like when 4K worth of data has accumulated)?
It's a nice little problem. I'm not aware of a prepackaged solution, but you could do a lot with simple shell scripts. I'd try this at the receiver:
#!/bin/ksh
# this is receiverprogram
while true
do
typeset -i length
read filename # read filename sent by sender below
read size # read size of file sent
read -N $size contents # read all the bytes of the file
print -n "$contents" > "$filename"
done
At the sender side I would create a named pipe and read from the pipe, e.g.,
mkfifo $HOME/my-connection
ssh remotehost receiver-script < $HOME/my-connection
Then to send a file I'd try this script
#!/bin/ksh
# this is senderprogram
FIFO=$HOME/my-connection
localname="$1"
remotename="$2"
print "$remotename" > $FIFO
size=$(stat -c %s "$localname")
print "$size" > $FIFO
cat "$localname" > $FIFO
If the file size is large you probably don't want to read it at one go, so something on the order of
BUFSIZ=8192
rm -f "$filename"
while ((size >= BUFSIZ)); do
read -N $BUFSIZE buffer
print -n "$buffer" >> "$filename"
size=$((size - BUFSIZ))
done
read -N $size buffer
print -n "$contents" >> "$filename"
Eventually you'll want to extend the script so you can pass through chmod and chgrp commands. Since you trust the sending code, it's probably easiest to structure the thing so that the receiver simply calls shell eval on each line, then send stuff like
print filename='"'"$remotename"'"' > $FIFO
print "read_and_copy_bytes " '$filename' "$size" > $FIFO
and then define a local function read_and_copy_bytes. Getting the quoting right is a bear, but otherwise it should be straightforward.
Of course, none of this has been tested! But I hope it gives you some useful ideas.
Seems like a job for tar? Pipe its output to ssh, and on the other side pipe the ssh output back to tar.
I think that the GNOME desktop uses a single SSH connection when accessing a share through SFTP (SSH). I'm guessing that this is what's happening because I see a single SSH process when I access a remote share this way. So if this is true you should be able to use the same program for this purpose.
The new version of GNOME used GVFS through GIO in order to perform all kind of I/O through different backends. The Ubuntu package gvfs-bin provides various command line utilities that let you manipulate the backends from the command line.
First you will need to mount your SSH folder:
gvfs-mount sftp://user#host/
And then you can use the gvfs-copy to copy your files. I think that all file transfers will be performed through a single SSH process. You can even use ps to see which process is being used.
If you feel more adventurous you can even write your own program in C or in some other high level language that provides an API to GIO.
One option is Conch is a SSH client and server implementation written in Python using the Twsited framework. You could use it to write a tool which accepts requests via some other protocol (HTTP or Unix domain sockets, FTP, SSH or whatever) and triggers file transfers over a long running SSH connection. In fact, I have several programs in production which use this technique to avoid multiple SSH connection setups.
There was a very similar question here a couple of weeks ago. The accepted answer proposed to open a tunnel when ssh'ing to the remote machine and to use that tunnel for scp transfers.
Perhapse CurlFTPFS might be a valid solution for you.
It looks like it just mounts an external computer's folder to your computer via SFTP. Once that's done, you should be able to use your regular cp commands and everything will be done securely.
Unfortunately I was not able to test it out myself, but let me know if it works for ya!
Edit 1: I have been able to download and test it. As I feared it does require that the client have a FTP server. However, I have found another program which does has exactly the same concept as what you are looking for. sshfs allows you to connect to your client computer without needing any special server. Once you have mounted one of their folders, you can use your normal cp commands to move whatever files you need to more. Once you are done, it should then be a smile matter of umount /path/to/mounted/folder. Let me know how this works out!
rsync -avlzp user#remotemachine:/path/to/files /path/to/this/folder
This will use SSH to transfer files, in a non-slow way
Keep it simple, write a little wrapper script that does something like this.
tar the files
send the tar-file
untar on the other side
Something like this:
tar -cvzf test.tgz files ....
scp test.tgz user#other.site.com:.
ssh user#other.site.com tar -xzvf test.tgz
/Johan

Resources