rsync --sparse does transfer whole data - rsync

I have some VM Images that need to synced everyday. The VM files are sparse'd.
To save network traffic i only want to transfer the real datas of the images.
I try it with --sparse option at rsync but on network traffic i see that the whole size get transfered over network and not only the real data usage.
If i use rsync -zv --sparse then only the real size get transmitted over network and everything is ok. But i dont want to compression the file because of the cpu usage.
Shouldnt the --sparse option transfer only real datas and the "null datas" get created locally to save network traffic?
Is there a workaround without compression?
Thanks!

Take a look a this discussion, specifically, this answer.
It seems that the solution is to do a rsync --sparse followed by a rsync --inplace.
On the first, --sparse, call, also use --ignore-existing to prevent already transferred sparse files to be overwritten, and -z to save network resources.
The second call, --inplace, should update only modified chunks. Here, compression is optional.
Also see this post.
Update
I believe the suggestions above won't solve your problem. I also believe that rsync is not the right tool for the task. You should search for other tools which will give you a good balance between network and disk I/O efficiency.
Rsync was designed for efficient usage of a single resource, the network. It assumes reading and writing to the network is much more expensive than reading and writing the source and destination files.
We assume that the two machines are connected by a low-bandwidth high-latency bi-directional communications link. The rsync algorithm, abstract.
The algorithm, summarized in four steps.
The receiving side β sends checksums of blocks of size S of the destination file B.
The sending side α identify blocks that match in the source file A, at any offset.
α sends β a list of instructions made of either verbatim, non-matching, data, or matching block references.
β reconstructs the whole file from those instructions.
Notice that rsync normally reconstructs the file B as a temporary file T, then replaces B with T. In this case it must write the whole file.
The --inplace does not relieve rsync from writing blocks matched by α, as one could imagine. They can match at different offsets. Scanning B a second time to take new data checksums is prohibitive in terms of performance. A block that matches in the same offset it was read on step one could be skipped, but rsync does not do that. In the case of a sparse file, a null block of B would match for every null block of A, and would have to be rewritten.
The --inplace just causes rsync to write directly to B, instead of T. It will rewrite the whole file.

The latest version of rsync can handle --sparse and --inplace together! I found the following github entry from 2016: https://github.com/tuna/rsync/commit/f3873b3d88b61167b106e7b9227a20147f8f6197

You could try to change the compression level to the lowest value (use the option --compress-level=1). The lowest compression level seems to be enough to reduce the traffic for sparse files. But I don't know, how the CPU usage is affected.

Related

Transfer function that saves progress?

Does a function exist similar to scp where if the connection is lost, then the progress is saved, and resuming the process picks up where it left off? I am trying to scp a large file, and my VPN connection keeps cutting out.
Use rsync --partial. It will keep partially transferred files, which you can then resume with the same invocation. From the rsync man page:
--partial
By default, rsync will delete any partially transferred file if the transfer is interrupted. In some circumstances it is
more desirable to keep partially transferred files. Using the --partial option tells rsync to keep the partial file which
should make a subsequent transfer of the rest of the file much faster.
Try something like rsync -aivz --partial user#host:/path/to/file ~/destination/folder/
Explanation of the other switches:
a — "archive mode": make transfer recursive; preserve symlinks, permissions, timestamps, group, owner; and (where possible) preserve special and device files
i — "itemize changes": shows you what exactly is getting changed (it will be a string of all + signs if you're copying a file anew +++++++)
v — "verbose": list files as they're transferred
z — "zip": compress data during transfer
Those are just the ones I usually use to transfer files. You can see a list of all options by looking at the rsync man page.

MPI one-sided file I/O

I have some questions on performing File I/Os using MPI.
A set of files are distributed across different processes.
I want the processes to read the files in the other processes.
For example, in one-sided communication, each process sets a window visible to other processors. I need the exactly same functionality. (Create 'windows' for all files and share them so that any process can read any file from any offset)
Is it possible in MPI? I read lots of documentations about MPI, but couldn't find the exact one.
The simple answer is that you can't do that automatically with MPI.
You can convince yourself by seeing that MPI_File_open() is a collective call taking an intra-communicator as first argument and returning a file handler to the opened file as last argument. In this communicator, all processes open the file and therefore, all processes must see the file. So unless a process sees a file, it cannot get a MPI_file handler to access it.
Now, that doesn't mean there's no solution. A possibility could be to do by hand exactly what you described, namely:
Each MPI process opens individually the file they see and are responsible of; then
Each of theses processes reads this local file into a buffer;
Theses individual buffers are all exposed, using either a global MPI_Win memory windows, or several individual ones, ready for one-sided read accesses; and finally
All read accesses to any data that were previously stored in these individual local files, are now done through MPI_Get() calls using the memory window(s).
The true limitation of this approach is that it requires to fully read all of the individual files, therefore, you need to have sufficient memory per node for storing each of them. I'm well aware that this is a very very big caveat that could just make the solution completely impractical. However, if the memory is sufficient, this is an easy approach.
Another even simpler solution would be to store the files into a shared file system, or having them all copied on all local file systems. I imagine this isn't an option since the question wouldn't have been asked otherwise...
Finally, in last resort, a possibility I see would be to dedicate a MPI process (or an OpenMP thread of a MPI process) per node to serve each files. This process would just act as a "file server", answering "read" request coming from the other MPI processes, and serving them by reading the requested data from the file, and sending it back via MPI. It's a bit lengthy to write, but it should work.

How does `scp` differ from `rsync`?

An article about setting up Ghost blogging says to use scp to copy from my local machine to a remote server:
scp -r ghost-0.3 root#*your-server-ip*:~/
However, Railscast 339: Chef Solo Basics uses scp to copy in the opposite direction (from the remote server to the local machine):
scp -r root#178.xxx.xxx.xxx:/var/chef .
In the same Railscast, when the author wants to copy files to the remote server (same direction as the first example), he uses rsync:
rsync -r . root#178.xxx.xxx.xxx:/var/chef
Why use the rsync command if scp will copy in both directions? How does scp differ from rsync?
The major difference between these tools is how they copy files.
scp basically reads the source file and writes it to the destination. It performs a plain linear copy, locally, or over a network.
rsync also copies files locally or over a network. But it employs a special delta transfer algorithm and a few optimizations to make the operation a lot faster. Consider the call.
rsync A host:B
rsync will check files sizes and modification timestamps of both A and B, and skip any further processing if they match.
If the destination file B already exists, the delta transfer algorithm will make sure only differences between A and B are sent over the wire.
rsync will write data to a temporary file T, and then replace the destination file B with T to make the update look "atomic" to processes that might be using B.
Another difference between them concerns invocation. rsync has a plethora of command line options, allowing the user to fine tune its behavior. It supports complex filter rules, runs in batch mode, daemon mode, etc. scp has only a few switches.
In summary, use scp for your day to day tasks. Commands that you type once in a while on your interactive shell. It's simpler to use, and in those cases rsync optimizations won't help much.
For recurring tasks, like cron jobs, use rsync. As mentioned, on multiple invocations it will take advantage of data already transferred, performing very quickly and saving on resources. It is an excellent tool to keep two directories synchronized over a network.
Also, when dealing with large files, use rsync with the -P option. If the transfer is interrupted, you can resume it where it stopped by reissuing the command. See Sid Kshatriya's answer.
Finally, note that rsync:// the protocol is similar to plain HTTP: unencrypted and no integrity checks. Be sure to always use rsync via SSH (as in the examples from the question above), not via the rsync protocol, unless you really know what you're doing. scp will always use SSH as underlying transport mechanism which has both integrity and confidentiality guarantees, so that is another difference between the two utilities.
rysnc can be useful to run on slow and unreliable connections. So if your download aborts in the middle of a large file rysnc will be able to continue from where it left off when invoked again.
Use rsync -vP username#host:/path/to/file .
The -P option preserves partially downloaded files and also shows progress.
As usual check man rsync
Difference b/w scp and rsync on different parameter
1. Performance over latency
scp : scp is relatively less optimise and speed
rsync : rsync is comparatively more optimise and speed
https://www.disk91.com/2014/technology/networks/compare-performance-of-different-file-transfer-protocol-over-latency/
2. Interruption handling
scp : scp command line tool cannot resume aborted downloads from lost network connections
rsync : If the above rsync session itself gets interrupted, you can resume it as many time as you want by typing the same command. rsync will automatically restart the transfer where it left off.
http://ask.xmodulo.com/resume-large-scp-file-transfer-linux.html
3. Command Example
scp
$ scp source_file_path destination_file_path
rsync
$ cd /path/to/directory/of/partially_downloaded_file
$ rsync -P --rsh=ssh userid#remotehost.com:bigdata.tgz ./bigdata.tgz
The -P option is the same as --partial --progress, allowing rsync to work with partially downloaded files. The --rsh=ssh option tells rsync to use ssh as a remote shell.
4. Security :
scp is more secure. You have to use rsync --rsh=ssh to make it as secure as scp.
man document to know more :
scp : http://www.manpagez.com/man/1/scp/
rsync : http://www.manpagez.com/man/1/rsync/
One major feature of rsync over scp (beside the delta algorithm and encryption if used w/ ssh) is that it automatically verifies if the transferred file has been transferred correctly. Scp will not do that, which occasionally might result in corruption when transferring larger files. So in general rsync is a copy with guarantee.
Centos manpages mention this the end of the --checksum option description:
Note that rsync always verifies that each transferred file was
correctly reconstructed on the receiving side by checking a whole-file
checksum that is generated as the file is transferred, but that
automatic after-the-transfer verification has nothing to do with this
option’s before-the-transfer “Does this file need to be updated?”
check.
There's a distinction to me that scp is always encrypted with ssh (secure shell), while rsync isn't necessarily encrypted. More specifically, rsync doesn't perform any encryption by itself; it's still capable of using other mechanisms (ssh for example) to perform encryption.
In addition to security, encryption also has a major impact on your transfer speed, as well as the CPU overhead. (My experience is that rsync can be significantly faster than scp.)
Check out this post for when rsync has encryption on.
scp is best for one file.
OR a combination of tar & compression for smaller data sets
like source code trees with small resources (ie: images, sqlite etc).
Yet, when you begin dealing with larger volumes say:
media folders (40 GB)
database backups (28 GB)
mp3 libraries (100 GB)
It becomes impractical to build a zip/tar.gz file to transfer with scp at this point do to the physical limits of the hosted server.
As an exercise, you can do some gymnastics like piping tar into ssh and redirecting the results into a remote file. (saving the need to build
a swap or temporary clone aka zip or tar.gz)
However,
rsync simplify's this process and allows you to transfer data without consuming any additional disc space.
Also,
Continuous (cron?) updates use minimal changes vs full cloned copies speed
up large data migrations over time.
tl;dr
scp == small scale (with room to build compressed files on the same drive)
rsync == large scale (with the necessity to backup large data and no room left)
it's better to think in a practical context. In our team, we use rsync -aP to replace a bad cassandra host in our cluster. We can't do this with scp (slow and no progress preservation).

What are the options for transferring 60GB+ files over the a network?

I'm about to start developing an application to transfer very large files without any rush but with need of reliability. I would like people that had worked coding such a particular case give me an insight of what I'm about to get into.
The environment will be intranet ftp server> so far using active ftp normal ports windows systems. I might need to also zip up the files before sending and I remember working with a library once that would zip in memory and there was a limit on the size... ideas on this would also be appreciated.
Let me know if I need to clarify something else. I'm asking for general/higher level gotchas if any not really detail help. I've done apps with normal sizes (up to 1GB) before but this one seems I'd need to limit the speed so I don't kill the network or things like that.
Thanks for any help.
I think you can get some inspiration from torrents.
Torrents generally break up the file in manageable pieces and calculate a hash of them. Later they transfer them piece by piece. Each piece is verified against hashes and accepted only if matched. This is very effective mechanism and let the transfer happen from multiple sources and also let is restart any number of time without worrying about corrupted data.
For transfer from a server to single client, I would suggest that you create a header which includes the metadata about the file so the receiver always knows what to expect and also knows how much has been received and can also check the received data against hashes.
I have practically implemented this idea on a client server application but the data size was much smaller, say 1500k but reliability and redundancy were important factors. This way, you can also effectively control the amount of traffic you want to allow through your application.
I think the way to go is to use the rsync utility as an external process to Python -
Quoting from here:
the pieces, using checksums, to possibly existing files in the target
site, and transports only those pieces that are not found from the
target site. In practice this means that if an older or partial
version of a file to be copied already exists in the target site,
rsync transports only the missing parts of the file. In many cases
this makes the data update process much faster as all the files are
not copied each time the source and target site get synchronized.
And you can use the -z switch to have compression on the fly for the data transfer transparently, no need to boottle up either end compressing the whole file.
Also, check the answers here:
https://serverfault.com/questions/154254/for-large-files-compress-first-then-transfer-or-rsync-z-which-would-be-fastest
And from rsync's man page, this might be of interest:
--partial
By default, rsync will delete any partially transferred
file if the transfer is interrupted. In some circumstances
it is more desirable to keep partially transferred files.
Using the --partial option tells rsync to keep the partial
file which should make a subsequent transfer of the rest of
the file much faster

Intercept outputs from a Program in Windows 7

I have an executable program which outputs data to the harddisk e.g. C:\documents.
I need some means to intercept the data in Windows 7 before they get to the hard drive. Then I will encrypt the data and send it back to the harddisk. Unfortunately, the .exe file does not support redirection command i.e. > in command prompt. Do you know how I can achieve such a thing in any programming language (c, c++, JAVA, php).
The encryption can only be done before the plain data is sent to the disk not after.
Any ideas most welcome. Thanks
This is virtually impossible in general. Many programs write to disk using memory-mapped files. In such a scheme, a memory range is mapped to (part of) a file. In such a scheme, writes to file can't be distinguished from writes to memory. A statement like p[OFFSET_OF_FIELD_X] = 17; is a logically write to file. Furthermore, the OS will keep track of the synchronization of memory and disk. Not all logical writes to memory are directly translated into physical writes to disk. From time to time, at the whim of the OS, dirty memory pages are copied back to disk.
Even in the simpler case of CreateFile/WriteFile, there's little room to intercept the data on the fly. The closest you could achieve is the use of Microsoft Detours. I know of at least one snakeoil encyption program (WxVault, crapware shipped on Dells) that does that. It repeatedly crashed my application in the field, which is why my program unpatches any attempt to intercept data on the fly. So, not even such hacks are robust against programs that dislike interference.

Resources