Replacing static files that are under heavy read load - nginx

Let us assume we have a static file server (Nginx + Linux) that serves 10 files. The files are read almost as frequently as the server can process. However, some of the files need to be replaced with new versions, so that the filename and URL address remain unaltered. How to replace the files safely without a fear that some reads fail or become a mix of two versions?
I understand this is a rather basic operating system matter and has something to do with renames, symlinks, and file sizes. However, I failed to find a clear reference or a good discussion and I hope we can build one here.

Use rsync. Typically I choose rsync -av src dst, but YMMV.
What is terrific about rsync is that, in addition to having essentially zero cost when little or nothing changed, it uses atomic rename. So during file transfer, a ".fooNNNNN" temp file gets bigger and bigger. Once completed, rsync closes the file and renames it on top of "foo". So web clients either see all of the old, or all of the new file. Notice that range downloads (say from restart after error) are not atomic, exposing such clients to lossage, especially if bytes were inserted near beginning of file. SHA1 wouldn't validate for such a client, and he would have to restart his download from scratch. BTW, if these are "large" files, tell nginx to use zero-copy sendfile().

Related

Stop rsync from backing up if too many files are being changed

Does anyone know of a way that I can tell rsync to not perform a backup if it detects and X amount of data will be changed? For example, if I run a backup and it detects and 25% of the data in the destination directory will be changed can I have it automatically abort that run and then I can evaluate and make a decision whether allow it or not. I back up my machine every night but what I'm worried about is if my machine gets hit with a ransomware bug or another issue that causes a ton of my data gets destroyed or lost I really don't want it to propagate to my backup. I used to tool call synconvery and it had this feature but I don't think the tool is supported very well and I get a lot of permission and read errors that I don't see with any other tools. Goodsync also has this feature but even though it runs on the Mac it doesn't support special characters in a file name and replaces it with an underscore when the file is copied. I just think that will cause problems when I try to restore those file and it's being referenced wit that special character but can't be found because it has a damn underscore. I like using rsync and I will eventually retrofit my script to use msrsync but I can't trust it if I can't get this protection in place.

What is the pro's and con's of locking the actual file vs an empty lock file?

My program is writing to a binary file, and there could be multiple instances of the program accessing the same binary file for the same user. In Unix/Linux, I see some programs (particularly daemon processes) locking an empty lock file instead of the actual shared data that needs to be locked (so instead of locking ~/.data/foo they lock ~/.data/foo.lck). What are the pros and cons of locking the actual file vs an empty lock file?
flock is not supported over NFS or other network file systems for all version of unix (it wasn't even supported by Linux until 2.6.12). On the other hand O_CREAT|O_EXCL is much more reliable over many more file systems, and has been so for much longer.
Even on systems that do support flock on network filesystems (or cases where you don't need that flexibility), O_CREAT|O_EXCL together with flock is very useful because it distinguishes between a clean shutdown and a non-clean shutdown. flock helpfully goes away automatically, but it also, unhelpfully, doesn't distinguish why it went away.
flocking the file itself prevents atomic writes (copy, erase old, rename), or any other case where you might erase the existing file. Sometimes "the actual file" doesn't always have the same inode over the entire run of the program. So a separate file is much more convenient in those cases as well. This is very common in those foo.lck cases, because often you're locking foo for a short period of time, and might erase it in the process.
I see three cons of an empty lock file:
The user permissions of the directory should allow you to create a file.
In case of disk space issues, this might fail.
In case your program crashes, the lockfile is still present.
I see one con of modifying the actual file's name:
In case your program crashes, your file has been altered (only the filename, but it might generate confusion).
Obviously, I see one big advantage of the empty lock file:
your original file does not change at all.
By the way, I believe this question is better suited for the SoftwareEngineering community.

Rsync - How to display only changed files

When my colleague and I upload a PHP web project to production, we use rsync for the file transfer with these arguments:
rsync -rltz --progress --stats --delete --perms --chmod=u=rwX,g=rwX,o=rX
When this runs, we see a long list of files that were changed.
Running this 2 times in a row, will always show the files that were changed between the 2 transfers.
However, when my colleague runs the same command after I did it, he will see a very long list of all files being changed (though the contents are identical) and this is extremely fast.
If he uploads again, then again there will be only minimal output.
So it seams to me that we get the correct output, only showing changes, but if someone else uploads from another computer, rsync will regard everything as changed.
I believe this may have something to do with the file permissions or times, but would like to know how to best solve this.
The idea is that we only see the changes, regardless who does the upload and in what order.
The huge file list is quite scary to see in a huge project, so we have no idea what was actually changed.
PS: We both deploy using the same user#server as target.
The t in your command says to copy the timestamps of the files, so if they don't match you'll see them get updated. If you think the timestamps on your two machines should match then the problem is something else.
The easiest way to ensure that the timestamps match would be to rsync them down from the server before making your edits.
Incidentally, having two people use rsync to update a production server seems error prone and fragile. You should consider putting your files in Git and pushing them to the server that way (you'd need a server-side hook to update the working copy for the web server to use).

How do I scp a file to a Unix host so that a file polling service won't see it before the copy is complete?

I am trying to transfer a file to a remote Unix server using scp. On that server, there is a service which polls the target directory to detect incoming files for processing. I would like to ensure that the polling service does not pick up new files before the copy is complete. Is there a way of doing that?
My file transfer process is a simple scp command embedded in a larger Java program. Ideally, a solution which did not involve changing the Jana would be best (for reasons involving change control processes).
You can scp the file to a different (/tmp) directory and move the
file via ssh after transfer is complete. The different directory needs to be on the same partition as the final destination directory otherwise there will be a copy operation and you'll face a similar problem. Another service on the destination machine can do this move operation.
You can copy the file as hidden (prefix the filename with .) and copy, then move
If you can modify the polling service, you can check active scp processes and ignore files matching scp arguments.
You can check for open files with lsof +d $directory and ignore them in the polling server
I suggest copying the file using rsync instead of scp. rsync already copies new files to temporary filenames, and has many other useful features for file synchronization as well.
$ rsync -a source/path/ remotehost:/target/path/
Of course, you can also copy file-by-file if that's your preference.
If rsync's temporary filenames are sufficient to avoid being picked up by your polling service, then you could simply replace your scp command with a shell script that acts as a wrapper for rsync, eliminating the need to change your Java program.
You would need to know the precise format that your Java program uses to call the scp command, to make sure that the options you feed to rsync do what you expect.
You would also need to figure out how your Java program calls scp. If it does so by full pathname (i.e. /usr/bin/scp), then this solution might put other things at risk on your system that depend on scp (like you, for example, expecting scp to behave as it usually does instead of as a wrapper). Changing a package-installed binary like /usr/bin/scp may also "break" your package registration, making it difficult to install future security updates because a binary has changed to a shell script. And of course, there might be security implications to any change you make.
All in all, I suspect you're better off changing your Java program to make it do precisely what you want, even if that is to launch a shell script to handle aspects of automation that you want to be able to change in the future without modifying your Java.
Good luck!

Strategy for handling user input as files

I'm creating a script to process files provided to us by our users. Everything happens within the same UNIX system (running on Solaris 10)
Right now our design is this
User places file into upload directory
Script placed on cron to run every 10 minutes.
Script looks for files in upload directory, processes them, deletes immediately afterward
For historical/legacy reasons, #1 can't change. Also, deleting the file after processing is a requirement.
My primary concern is concurrency. It is very likely that the situation will arise where the analysis script runs while an input file is still being written to. In this case, data will be lost and this (obviously) unacceptable.
Since we have no control over the user's chosen means of placing the input file, we cannot require them to obtain a file lock. As I understand, file locks are advisory only on UNIX. Therefore a user must choose to adhere to them.
I am looking for advice on best practices for handling this problem. Thanks
Obviously all the best solutions involve the client providing some kind of trigger indicating that it has finished uploading. That could be a second file, an atomic move of the file to a processing directory after writing it to a stage directory, or a REST web service. I will assume you have no control over your clients and are unable or unwilling to change anything about them.
In that case, you still have a few options:
You can use a pretty simple heuristic: check the file size, wait 5 seconds, check the file size. If it didn't change, it's probably good to go.
If you have super-user privileges, you can use lsof to determine if anyone has this file open for writing.
If you have access to the thing that handles upload (HTTP, FTP, a setuid script that copies files?) you can put triggers in there of course.

Resources