Strategy for handling user input as files - unix

I'm creating a script to process files provided to us by our users. Everything happens within the same UNIX system (running on Solaris 10)
Right now our design is this
User places file into upload directory
Script placed on cron to run every 10 minutes.
Script looks for files in upload directory, processes them, deletes immediately afterward
For historical/legacy reasons, #1 can't change. Also, deleting the file after processing is a requirement.
My primary concern is concurrency. It is very likely that the situation will arise where the analysis script runs while an input file is still being written to. In this case, data will be lost and this (obviously) unacceptable.
Since we have no control over the user's chosen means of placing the input file, we cannot require them to obtain a file lock. As I understand, file locks are advisory only on UNIX. Therefore a user must choose to adhere to them.
I am looking for advice on best practices for handling this problem. Thanks

Obviously all the best solutions involve the client providing some kind of trigger indicating that it has finished uploading. That could be a second file, an atomic move of the file to a processing directory after writing it to a stage directory, or a REST web service. I will assume you have no control over your clients and are unable or unwilling to change anything about them.
In that case, you still have a few options:
You can use a pretty simple heuristic: check the file size, wait 5 seconds, check the file size. If it didn't change, it's probably good to go.
If you have super-user privileges, you can use lsof to determine if anyone has this file open for writing.
If you have access to the thing that handles upload (HTTP, FTP, a setuid script that copies files?) you can put triggers in there of course.

Related

Stop rsync from backing up if too many files are being changed

Does anyone know of a way that I can tell rsync to not perform a backup if it detects and X amount of data will be changed? For example, if I run a backup and it detects and 25% of the data in the destination directory will be changed can I have it automatically abort that run and then I can evaluate and make a decision whether allow it or not. I back up my machine every night but what I'm worried about is if my machine gets hit with a ransomware bug or another issue that causes a ton of my data gets destroyed or lost I really don't want it to propagate to my backup. I used to tool call synconvery and it had this feature but I don't think the tool is supported very well and I get a lot of permission and read errors that I don't see with any other tools. Goodsync also has this feature but even though it runs on the Mac it doesn't support special characters in a file name and replaces it with an underscore when the file is copied. I just think that will cause problems when I try to restore those file and it's being referenced wit that special character but can't be found because it has a damn underscore. I like using rsync and I will eventually retrofit my script to use msrsync but I can't trust it if I can't get this protection in place.

Is there a way to automatically make a copy of a file each time it is updated in Unix?

I have an application that updates some files in Unix server. Since I cannot modify this application, is there any way I can make sure that these files are copied before each update so I can have a history of the changes?
Is there a way/tool in Unix so I can do that?
If on Linux (specifically) you could use inotify(7) facilities (perhaps via incrontab ...)
Alternatively, you might run periodically (thru some crontab(5) entry) a script doing some make with your particular Makefile (since GNU make is designed to care about timestamps) managing e.g. backups. Or you could periodically run some rsync command.
However, it smells like you need some revision control (also known as version control system). I strongly recommend git; you could use it before and after running your application (e.g. write some wrapping shell script doing that).
But there is probably no universal solution (e.g. what if the monitored application is keeping a file descriptor opened for a long time, and write the file little by little...). You should explain much more what is happening and what do you want ...

Is it better to execute a file over the network or copy it locally first?

My winforms app needs to run an executable that's sitting on a share. The exe is about 50MB (it's a setup.exe type of file). My app will run on many different machines/networks with varying speeds (some fast, but some awfully slow, like barely 10baseT speeds).
Is it better to execute the file straight from the share or is it more efficient to copy it locally and then execute it? I am talking in terms of annoying the user the least.
Locally is better. A copy will read each byte of the file a single time, no more, no less. As you execute, you may revisit code that is out of cache, etc and gets pulled again.
As a setup program, I would assume that the engine will want to do some kind of CRC or other integrity check too, which means it's reading the entire file anyway.
It is always better to execute it locally than running it over the network.
If you're application is small, and does not need to load many different resource during runtime then it is ok to run it over the network. It might even be preferable because if you run it over the network the code is read (download and load to memory) once as oppose of manually downloading the file then run it which take 2 read code. For example you can run a clock widget application over the network.
On the other hand, if your application does read a lot of resources during runtim, then it is absolutely a bad idea to run it over the network because each read of the resource will go over the network, which is very slow. For example, you probably don't want to be running Eclipse over the network.
Another factor to take into consideration is how many concurrent user will be accessing the application at the same time. If there are many, you should copy the application to local and run from there.
I believe the OS always copy the file to a local temp folder before it is actually executed. There are no round trips from/to the network after it gets a copy, it only happens once. This is sort of like how a browser works... it first retrieves the file, saves it locally, then it runs if off of the local temp where it saved it. In other words, there is no need to copy it manually unless you want to keep a copy for yourself.

concurrent reading and writing image files (asp.net, but applies to most web languages)

I have a .jpg file which represents the current image from a webcam. User's will be downloading this file at an interval of once a second. Because there could be dozens of users reading it, this could be dozens of times a second (which is normal for any web server).
Problem is, this image is updated by a 3rd party application also once a second which "spiders" my local networks webcam portal image. This is so we can build our webcams into our current administration panel.
The problem I am already finding is ASP.net sometimes gets an error it can not access the file because it is open for write permissions by the bot. Likewise, the bot can not access it because IIS is feeding it to the user.
The bot uses io.streamwriter to save the data to the file, and my script uses Response.WriteFile to send the file to the script. (I need to use an actual ASP.net page with a JPG content-type that feeds the file to make sure only users with a active session can view the JPG).
My question is what is the best practices for this? I know why it's happening but what is the best resolution for this? Would storing as a BLOB in a database maybe be smarter since databases are created for concurrent read/writing already? Is there an easier way of doing this with a file I have not thought of yet?
Thanks in advance,
Anthony Greco
Using a BLOB will work if the readers use SNAPSHOT isolation model (SQL Server 2005 and up). See Download and Upload images from SQL Server via ASP.Net MVC for how to stream an image from a BLOB, and see Understanding Row Versioning-Based Isolation Levels for a lecture on SNAPSHOT.
But using a BLOB may be overkill, you could get away with something much simpler. For instance, if you only have one ASP.Net process, then you could have a global volatile variable for the current file name. The writer writes the JPG into a new file, and then updates the global 'current' file name with an Interlocked.CompareExchange operation (it has to be Compare because a newer writer might actually finish faster, outrun a previous writer, and you want to preserve the latest update). There are still some issues left to solve (find out the file name at startup, clean up old files etc) but they are all fairly ease to solve.
If you have a farm of servers, or multiple ASP.Net processes serving the site, then things could get complicated. I would still do a rotating file name and do a try-and-error approach (try to respond with newest file, fall back to previous older one if conflict is detected).
You could get the bot to write the data to a different filename and then do a delete and rename to the filename being served by ASP.Net. This should reduce the file lock time down to the time for a delete and rename to occur. To clarify:
ASP.Net serving image from "webcam.jpg"
bot writes image data to "temp.jpg"
when last image byte written, bot deletes "webcam.jpg" and renames "temp.jpg" to "webcam.jpg"
ASP.Net should check "webcam.jpg" exists, if not wait 10ms (or suitable small increment) and check again.

How to let humans and programs access the same file without stepping on each others' toes

Suppose I have a file, urls.txt, that contains a list of URLs I'm monitoring. My monitoring script edits that file occasionally, say, to indicate whether each URL is reachable. I'd like to also manually edit that file, to add to or change the list of URLs. How can I allow that such that I don't have to think about it when manually editing?
Here are some possible answers. What would you do?
Engage in hackery like having the program check for the lockfiles that vim or emacs create. Since this is just for me, this would actually work.
If the human edits always take precedence, just always have the human clobber the program's changes (eg, ignore the editor's warning that the file has changed on disk). The program can then just redo its changes on its next loop. Still, changing the file while the user edits it is not so nice.
Never let a human touch a file that a program makes ongoing modifications to. Rethink the design and have one file that only the human edits and another file that only the program edits.
Give the human a custom tool to edit the file that does the appropriate file locking. That could be as crude as locking the file and then launching an editor, or a custom interface (perhaps a simple command line interface) for inserting/changing/deleting entries from the file.
Use a database instead of a flat file and then the locking is all taken care of automatically.
(Note that I concocted the URL monitoring example to make this more concrete and because what I actually have in mind is perhaps too weird and distracting -- this question is strictly about how to let humans and programs both modify the same state file.)
I'd use a database since that's basically what you're going to have to build to achieve what you want. Why re-invent the wheel?
If a full-blown DBMS is too much of a load, separate the files into two and synchronize them periodically. Whether the URL is reachable doesn't sound like something the user would be changing, so should not be editable by them.
During the synchronize process (which would have to lock out the monitor and the user although it could be a sub-function of the monitor), remove entries in the monitor file that aren't in the user full. Also, add to the monitor file those that have been added to the user file (and start monitoring them).
But, I'd go the database method with a special front-end for the user, since you can get relatively good light-weight databases nowadays.
Use a sensible version control system!
(Git would work well here).
That said, the nature of the problem implies that a real database would be best - and they will generally have either database-level, table-level, or row-level locking - but then put any scripts you need into version control.
I would go with option 3. In fact, I would have the program read the human-edited input file, and append the results of each query to a log file. In this way, you can also analyse the reachability of sites over time. You can also have the program maintain a file that indicates the current reachability state of each site in the input file, as a snapshot of the current state.
One other option is using two files, one for automated access and one for manual. You'd need a way in the user file to indicate modifications or deletions but you'd have similar problems in some of the other solutions as well.

Resources