Transfer function that saves progress? - unix

Does a function exist similar to scp where if the connection is lost, then the progress is saved, and resuming the process picks up where it left off? I am trying to scp a large file, and my VPN connection keeps cutting out.

Use rsync --partial. It will keep partially transferred files, which you can then resume with the same invocation. From the rsync man page:
--partial
By default, rsync will delete any partially transferred file if the transfer is interrupted. In some circumstances it is
more desirable to keep partially transferred files. Using the --partial option tells rsync to keep the partial file which
should make a subsequent transfer of the rest of the file much faster.
Try something like rsync -aivz --partial user#host:/path/to/file ~/destination/folder/
Explanation of the other switches:
a — "archive mode": make transfer recursive; preserve symlinks, permissions, timestamps, group, owner; and (where possible) preserve special and device files
i — "itemize changes": shows you what exactly is getting changed (it will be a string of all + signs if you're copying a file anew +++++++)
v — "verbose": list files as they're transferred
z — "zip": compress data during transfer
Those are just the ones I usually use to transfer files. You can see a list of all options by looking at the rsync man page.

Related

Rsync hangs when transfering large files on an bad connection

I've been having problems transferring files over a pretty bad connection (I'm trying to upload a file on a cloud server) via rsync.
The rsync essentially hangs after about a minute or so. This is the way I'm trying to perform the upload:
rsync -avz --progress -e "ssh" ~/target.war root#my-remote-server:~/
There is no error message or other info. It simply hangs displaying something like:
7307264 14% 92.47kB/s 0:07:59
Ping-ing the remote endpoint doesn't seem to be loosing packages as far as I see.
Any help on how to overcome this problem would be nice. Thank you.
The --partial option keeps partially transferred files if the transfer is interrupted. You could use it to try again without having to transfer the whole file again.
In fact, there is the -P option, which is equivalent to --partial --progress. According to rsync's man page, "Its purpose is to make it much easier to specify these two options for a long transfer that may be interrupted."

How does `scp` differ from `rsync`?

An article about setting up Ghost blogging says to use scp to copy from my local machine to a remote server:
scp -r ghost-0.3 root#*your-server-ip*:~/
However, Railscast 339: Chef Solo Basics uses scp to copy in the opposite direction (from the remote server to the local machine):
scp -r root#178.xxx.xxx.xxx:/var/chef .
In the same Railscast, when the author wants to copy files to the remote server (same direction as the first example), he uses rsync:
rsync -r . root#178.xxx.xxx.xxx:/var/chef
Why use the rsync command if scp will copy in both directions? How does scp differ from rsync?
The major difference between these tools is how they copy files.
scp basically reads the source file and writes it to the destination. It performs a plain linear copy, locally, or over a network.
rsync also copies files locally or over a network. But it employs a special delta transfer algorithm and a few optimizations to make the operation a lot faster. Consider the call.
rsync A host:B
rsync will check files sizes and modification timestamps of both A and B, and skip any further processing if they match.
If the destination file B already exists, the delta transfer algorithm will make sure only differences between A and B are sent over the wire.
rsync will write data to a temporary file T, and then replace the destination file B with T to make the update look "atomic" to processes that might be using B.
Another difference between them concerns invocation. rsync has a plethora of command line options, allowing the user to fine tune its behavior. It supports complex filter rules, runs in batch mode, daemon mode, etc. scp has only a few switches.
In summary, use scp for your day to day tasks. Commands that you type once in a while on your interactive shell. It's simpler to use, and in those cases rsync optimizations won't help much.
For recurring tasks, like cron jobs, use rsync. As mentioned, on multiple invocations it will take advantage of data already transferred, performing very quickly and saving on resources. It is an excellent tool to keep two directories synchronized over a network.
Also, when dealing with large files, use rsync with the -P option. If the transfer is interrupted, you can resume it where it stopped by reissuing the command. See Sid Kshatriya's answer.
Finally, note that rsync:// the protocol is similar to plain HTTP: unencrypted and no integrity checks. Be sure to always use rsync via SSH (as in the examples from the question above), not via the rsync protocol, unless you really know what you're doing. scp will always use SSH as underlying transport mechanism which has both integrity and confidentiality guarantees, so that is another difference between the two utilities.
rysnc can be useful to run on slow and unreliable connections. So if your download aborts in the middle of a large file rysnc will be able to continue from where it left off when invoked again.
Use rsync -vP username#host:/path/to/file .
The -P option preserves partially downloaded files and also shows progress.
As usual check man rsync
Difference b/w scp and rsync on different parameter
1. Performance over latency
scp : scp is relatively less optimise and speed
rsync : rsync is comparatively more optimise and speed
https://www.disk91.com/2014/technology/networks/compare-performance-of-different-file-transfer-protocol-over-latency/
2. Interruption handling
scp : scp command line tool cannot resume aborted downloads from lost network connections
rsync : If the above rsync session itself gets interrupted, you can resume it as many time as you want by typing the same command. rsync will automatically restart the transfer where it left off.
http://ask.xmodulo.com/resume-large-scp-file-transfer-linux.html
3. Command Example
scp
$ scp source_file_path destination_file_path
rsync
$ cd /path/to/directory/of/partially_downloaded_file
$ rsync -P --rsh=ssh userid#remotehost.com:bigdata.tgz ./bigdata.tgz
The -P option is the same as --partial --progress, allowing rsync to work with partially downloaded files. The --rsh=ssh option tells rsync to use ssh as a remote shell.
4. Security :
scp is more secure. You have to use rsync --rsh=ssh to make it as secure as scp.
man document to know more :
scp : http://www.manpagez.com/man/1/scp/
rsync : http://www.manpagez.com/man/1/rsync/
One major feature of rsync over scp (beside the delta algorithm and encryption if used w/ ssh) is that it automatically verifies if the transferred file has been transferred correctly. Scp will not do that, which occasionally might result in corruption when transferring larger files. So in general rsync is a copy with guarantee.
Centos manpages mention this the end of the --checksum option description:
Note that rsync always verifies that each transferred file was
correctly reconstructed on the receiving side by checking a whole-file
checksum that is generated as the file is transferred, but that
automatic after-the-transfer verification has nothing to do with this
option’s before-the-transfer “Does this file need to be updated?”
check.
There's a distinction to me that scp is always encrypted with ssh (secure shell), while rsync isn't necessarily encrypted. More specifically, rsync doesn't perform any encryption by itself; it's still capable of using other mechanisms (ssh for example) to perform encryption.
In addition to security, encryption also has a major impact on your transfer speed, as well as the CPU overhead. (My experience is that rsync can be significantly faster than scp.)
Check out this post for when rsync has encryption on.
scp is best for one file.
OR a combination of tar & compression for smaller data sets
like source code trees with small resources (ie: images, sqlite etc).
Yet, when you begin dealing with larger volumes say:
media folders (40 GB)
database backups (28 GB)
mp3 libraries (100 GB)
It becomes impractical to build a zip/tar.gz file to transfer with scp at this point do to the physical limits of the hosted server.
As an exercise, you can do some gymnastics like piping tar into ssh and redirecting the results into a remote file. (saving the need to build
a swap or temporary clone aka zip or tar.gz)
However,
rsync simplify's this process and allows you to transfer data without consuming any additional disc space.
Also,
Continuous (cron?) updates use minimal changes vs full cloned copies speed
up large data migrations over time.
tl;dr
scp == small scale (with room to build compressed files on the same drive)
rsync == large scale (with the necessity to backup large data and no room left)
it's better to think in a practical context. In our team, we use rsync -aP to replace a bad cassandra host in our cluster. We can't do this with scp (slow and no progress preservation).

rsync --sparse does transfer whole data

I have some VM Images that need to synced everyday. The VM files are sparse'd.
To save network traffic i only want to transfer the real datas of the images.
I try it with --sparse option at rsync but on network traffic i see that the whole size get transfered over network and not only the real data usage.
If i use rsync -zv --sparse then only the real size get transmitted over network and everything is ok. But i dont want to compression the file because of the cpu usage.
Shouldnt the --sparse option transfer only real datas and the "null datas" get created locally to save network traffic?
Is there a workaround without compression?
Thanks!
Take a look a this discussion, specifically, this answer.
It seems that the solution is to do a rsync --sparse followed by a rsync --inplace.
On the first, --sparse, call, also use --ignore-existing to prevent already transferred sparse files to be overwritten, and -z to save network resources.
The second call, --inplace, should update only modified chunks. Here, compression is optional.
Also see this post.
Update
I believe the suggestions above won't solve your problem. I also believe that rsync is not the right tool for the task. You should search for other tools which will give you a good balance between network and disk I/O efficiency.
Rsync was designed for efficient usage of a single resource, the network. It assumes reading and writing to the network is much more expensive than reading and writing the source and destination files.
We assume that the two machines are connected by a low-bandwidth high-latency bi-directional communications link. The rsync algorithm, abstract.
The algorithm, summarized in four steps.
The receiving side β sends checksums of blocks of size S of the destination file B.
The sending side α identify blocks that match in the source file A, at any offset.
α sends β a list of instructions made of either verbatim, non-matching, data, or matching block references.
β reconstructs the whole file from those instructions.
Notice that rsync normally reconstructs the file B as a temporary file T, then replaces B with T. In this case it must write the whole file.
The --inplace does not relieve rsync from writing blocks matched by α, as one could imagine. They can match at different offsets. Scanning B a second time to take new data checksums is prohibitive in terms of performance. A block that matches in the same offset it was read on step one could be skipped, but rsync does not do that. In the case of a sparse file, a null block of B would match for every null block of A, and would have to be rewritten.
The --inplace just causes rsync to write directly to B, instead of T. It will rewrite the whole file.
The latest version of rsync can handle --sparse and --inplace together! I found the following github entry from 2016: https://github.com/tuna/rsync/commit/f3873b3d88b61167b106e7b9227a20147f8f6197
You could try to change the compression level to the lowest value (use the option --compress-level=1). The lowest compression level seems to be enough to reduce the traffic for sparse files. But I don't know, how the CPU usage is affected.

Could using -p "-partial" in rsync result in corrupted files?

I'm using a script which runs once an hour to rsync some files.
Sometimes the script is interrupted before it completes, e.g. because the client or server shut down.
For most files this is not an issue, as next time the script runs it will copy those.
However, for some large files that take a long time over LAN it may be interrupted before it completes. This means that rsync will have to start from scratch the next time but if it is interrupted again this second time etc then the files will never copy.
I'm therefore thinking of adding the -partial flag as described here:
Resuming rsync partial (-P/--partial) on a interrupted transfer
https://unix.stackexchange.com/questions/2445/resume-transfer-of-a-single-file-by-rsync
I have tested with "-partial" and it does work, i.e the operation continues from the last transferred file fragment.
My concern is whether this increases the risk of corrupted files?
https://lists.samba.org/archive/rsync/2002-August/003526.html
https://lists.samba.org/archive/rsync/2002-August/003525.html
Or to put it another way, even if "-partial" does create some corruption, then then next time rsync runs will it find and "correct" those corrupted blocks?
If that's the case then I'm OK to use "-partial" and in the case of any corruption it will simply be corrected the next time?
Thanks.
PS: I don't want to use "-c" as it creates too much hard disk activity.
From the rsync documentation on --checksum:
Note that rsync always verifies that each transferred file was correctly reconstructed on the receiving side by checking a whole-file checksum that is generated as the file is transferred, but that automatic after-the-transfer verification has nothing to do with this option’s before-the-transfer "Does this file need to be updated?" check.
So, yes, rsync will correct any partially transferred files, no matter how corrupted they are (in the worst case, all the data will be transferred again).

NFS sync vs async

I'm using NFS to allow two servers two communicate via simple text files, however sometimes it seems that the server reading the text files to get information is reading incomplete files, and then because of this crashes. Then I go to look at the "incomplete" file that made it crash, and the file is complete. Is it possible that the server reading these files is seeing them before they are complete written by NFS? I use linux's mv to move them from the local machine to NFS only when they are completely written, so there "should" never be an incomplete state on the NFS.
Could this problem have something to do with sync vs async? Right now I'm using async. From my understanding, async just means that you return from the write and your program can continue running, and this write will happen at a later time. Whereas sync means that your process will wait for that write to go through before it moves on. Would changing to sync fix this? Or is there a better way to handle this? I know two servers could communicate via a database, but I'm actually doing this to try to keep database usage down. Thanks!
mv across file-systems translates into cp+rm and is certainly not atomic, even without NFS involved. You should first copy the file to a temporary in the target file-system and then rename it to the correct name. For instance, instead of:
$ mv myfile.txt /mnt/targetfs/myfile.txt
do:
$ mv myfile.txt /mnt/targetfs/.myfile.txt.tmp
$ mv /mnt/targetfs/.myfile.txt.tmp /mnt/targetfs/myfile.txt
(This assumes that the process reading the file ignores it while it does not have the correct name.)

Resources