Any way to make rsync based transfers faster using a staging directory - rsync

I rsync between two systems using ssh. The command line that I use is something like:
rsync -avzh -e "ssh" ./builds/2023_02_02 akamaicdn:builds/2023_02_02
The akamaicdn is accessed using ssh and the corresponding identity, host name etc are specified in ~/.ssh/config file.
Most of the times the destination dir doesn't exist. That means it is a full upload after rsync optimizations.
But the content that gets uploaded next day has lot of things similar to the ones from previous day as these are build dirs and we have lot of common content between them.
Is there any way to tell remote rsync to use set of previous dirs to scan when it is determining what parts of a file have to be uploaded?
I am open to other optimizations if you can think of.

Related

Replacing static files that are under heavy read load

Let us assume we have a static file server (Nginx + Linux) that serves 10 files. The files are read almost as frequently as the server can process. However, some of the files need to be replaced with new versions, so that the filename and URL address remain unaltered. How to replace the files safely without a fear that some reads fail or become a mix of two versions?
I understand this is a rather basic operating system matter and has something to do with renames, symlinks, and file sizes. However, I failed to find a clear reference or a good discussion and I hope we can build one here.
Use rsync. Typically I choose rsync -av src dst, but YMMV.
What is terrific about rsync is that, in addition to having essentially zero cost when little or nothing changed, it uses atomic rename. So during file transfer, a ".fooNNNNN" temp file gets bigger and bigger. Once completed, rsync closes the file and renames it on top of "foo". So web clients either see all of the old, or all of the new file. Notice that range downloads (say from restart after error) are not atomic, exposing such clients to lossage, especially if bytes were inserted near beginning of file. SHA1 wouldn't validate for such a client, and he would have to restart his download from scratch. BTW, if these are "large" files, tell nginx to use zero-copy sendfile().

rsync running at the same time on two machines

I would like to set up rsync to run on two machines.
Machine A rsync to Machine B
Machine B rsync to Machine A
Would there be any risk of running rsync in two machines at the same time, synching the same files?
If you're very careful you can do it:
Ensure that you use --times so that the time stamps always match, and --update so that it will not overwrite newer versions.
Use --temp-dir with a directory outside the synced tree so that rsync does not find any temp files during its scan. The temporary directory should be on the same file-system as the destination, however, or else atomic moves won't be possible.
Do not use --inplace, any variant of --delete, or --checksum.
Even then, I would always do it first one way, and then the other, or use a tool like SyncThing.

synchronization over http: rsync versus normal upload

I'm running file synchronization over HTTP. Both sides implement rsync. When synchronizing, for uploading I have two choices:
use a simple post request if:
the file to be uploaded does'nt exists on the remote side.
the file exists and is bigger than a certain value M.
else : perform rsync over get requests.
My question is: How can I determine the perfect value of M.
I'm certain that for a certain file size, performing simple upload is faster than performing rsync steps . Especially for multiple files.
Thanks
If you're using rsync correctly, I'd bet that it's always faster, especially with multiple files.
Rsync is specially built to check differences between directory trees and update the target directory incrementatlly.
The following is a one-liner to keep in mind whenever you need to sync two directory trees.
rsync -av --delete /path/to/src /path/to/target
(also works over SSH, if necessary.)
Only keep in mind that rsync is picky about trailing slashes on directory paths.

How do I scp a file to a Unix host so that a file polling service won't see it before the copy is complete?

I am trying to transfer a file to a remote Unix server using scp. On that server, there is a service which polls the target directory to detect incoming files for processing. I would like to ensure that the polling service does not pick up new files before the copy is complete. Is there a way of doing that?
My file transfer process is a simple scp command embedded in a larger Java program. Ideally, a solution which did not involve changing the Jana would be best (for reasons involving change control processes).
You can scp the file to a different (/tmp) directory and move the
file via ssh after transfer is complete. The different directory needs to be on the same partition as the final destination directory otherwise there will be a copy operation and you'll face a similar problem. Another service on the destination machine can do this move operation.
You can copy the file as hidden (prefix the filename with .) and copy, then move
If you can modify the polling service, you can check active scp processes and ignore files matching scp arguments.
You can check for open files with lsof +d $directory and ignore them in the polling server
I suggest copying the file using rsync instead of scp. rsync already copies new files to temporary filenames, and has many other useful features for file synchronization as well.
$ rsync -a source/path/ remotehost:/target/path/
Of course, you can also copy file-by-file if that's your preference.
If rsync's temporary filenames are sufficient to avoid being picked up by your polling service, then you could simply replace your scp command with a shell script that acts as a wrapper for rsync, eliminating the need to change your Java program.
You would need to know the precise format that your Java program uses to call the scp command, to make sure that the options you feed to rsync do what you expect.
You would also need to figure out how your Java program calls scp. If it does so by full pathname (i.e. /usr/bin/scp), then this solution might put other things at risk on your system that depend on scp (like you, for example, expecting scp to behave as it usually does instead of as a wrapper). Changing a package-installed binary like /usr/bin/scp may also "break" your package registration, making it difficult to install future security updates because a binary has changed to a shell script. And of course, there might be security implications to any change you make.
All in all, I suspect you're better off changing your Java program to make it do precisely what you want, even if that is to launch a shell script to handle aspects of automation that you want to be able to change in the future without modifying your Java.
Good luck!

rsync list of specific local files in 1 step

I'm working on a web application where a user uploads a list of files, which should then be immediately rsynced to a remote server. I have a list of all the local files that need to be rsynced, but they will be mixed in with other files that I do not want rsynced every time. I know rsync will only send the changed files, but this directory structure and contents will grow very large over time and the delay would not be acceptable.
I know that doing a remote rsync, I can specify a list of remote files, i.e...
rsync "host:/path/to/file1 /path/to/file2 /path/to/file3"
... but that does not work once I remove "host:" and try to specify the files locally.
I also know I can use --files-from, but that would require me to create a file ahead of time with a list of files that I want to rsync (and then delete it afterwards). I think it'd be cleaner to just effectively say "rsync these 4 specific files to this remote server", but I can't seem to get that to work.
Is there any way to do what I'm trying to accomplish, or do I have to resort to creating a tmp file with a list in it?
Thanks!
You should be able to list the files similar to the example you gave. I did this on my machine to copy 2 specific files from a directory with many other files present.
rsync test.sql test2.cpp myUser#myHost:path/to/files/synced/

Resources