Is there a way to automatically make a copy of a file each time it is updated in Unix? - unix

I have an application that updates some files in Unix server. Since I cannot modify this application, is there any way I can make sure that these files are copied before each update so I can have a history of the changes?
Is there a way/tool in Unix so I can do that?

If on Linux (specifically) you could use inotify(7) facilities (perhaps via incrontab ...)
Alternatively, you might run periodically (thru some crontab(5) entry) a script doing some make with your particular Makefile (since GNU make is designed to care about timestamps) managing e.g. backups. Or you could periodically run some rsync command.
However, it smells like you need some revision control (also known as version control system). I strongly recommend git; you could use it before and after running your application (e.g. write some wrapping shell script doing that).
But there is probably no universal solution (e.g. what if the monitored application is keeping a file descriptor opened for a long time, and write the file little by little...). You should explain much more what is happening and what do you want ...

Related

How do I scp a file to a Unix host so that a file polling service won't see it before the copy is complete?

I am trying to transfer a file to a remote Unix server using scp. On that server, there is a service which polls the target directory to detect incoming files for processing. I would like to ensure that the polling service does not pick up new files before the copy is complete. Is there a way of doing that?
My file transfer process is a simple scp command embedded in a larger Java program. Ideally, a solution which did not involve changing the Jana would be best (for reasons involving change control processes).
You can scp the file to a different (/tmp) directory and move the
file via ssh after transfer is complete. The different directory needs to be on the same partition as the final destination directory otherwise there will be a copy operation and you'll face a similar problem. Another service on the destination machine can do this move operation.
You can copy the file as hidden (prefix the filename with .) and copy, then move
If you can modify the polling service, you can check active scp processes and ignore files matching scp arguments.
You can check for open files with lsof +d $directory and ignore them in the polling server
I suggest copying the file using rsync instead of scp. rsync already copies new files to temporary filenames, and has many other useful features for file synchronization as well.
$ rsync -a source/path/ remotehost:/target/path/
Of course, you can also copy file-by-file if that's your preference.
If rsync's temporary filenames are sufficient to avoid being picked up by your polling service, then you could simply replace your scp command with a shell script that acts as a wrapper for rsync, eliminating the need to change your Java program.
You would need to know the precise format that your Java program uses to call the scp command, to make sure that the options you feed to rsync do what you expect.
You would also need to figure out how your Java program calls scp. If it does so by full pathname (i.e. /usr/bin/scp), then this solution might put other things at risk on your system that depend on scp (like you, for example, expecting scp to behave as it usually does instead of as a wrapper). Changing a package-installed binary like /usr/bin/scp may also "break" your package registration, making it difficult to install future security updates because a binary has changed to a shell script. And of course, there might be security implications to any change you make.
All in all, I suspect you're better off changing your Java program to make it do precisely what you want, even if that is to launch a shell script to handle aspects of automation that you want to be able to change in the future without modifying your Java.
Good luck!

Create own unix commands

Is it possible to create our own unix commands?
For example: we have ls -ltr,cd,mkdir etc which perform certain actions. I want to create a similar command which would save username-password into a table in database. I'm kinda new to unix. Any suggestions?
Yes, it is easy to create your own commands that do jobs that you find useful. You can implement them in a variety of languages, from shell to Perl to C and on and on.
The only significance to the standard commands are that they are installed (usually) in /bin or /usr/bin rather than anywhere else, and they do jobs that are defined by a standard (often POSIX). Often, people place locally created commands in /usr/local/bin; others will create themselves a directory $HOME/bin and put their personal commands there. You simply need to ensure that these directories are on your PATH.
In my $HOME/bin directory (depending on which machine I'm looking at), I have from 46 commands (on this machine) up to about 500 on my main work machines. The commands do different jobs; the names are mnemonic to me (and generally not to other people). Some commands are polished and ready for production use anywhere (and these have manual pages, almost by definition of being production-ready). Others are quick hacks assembled for a quick-and-dirty job. Some of the quick hacks are removed; some get polished; some get stashed away in case I need to do something similar in the future. Only the trivial don't go under version control.
On this machine (which I only use casually and not really for development work), I have 9 shell scripts, 4 Perl scripts, and the rest are executables (Git and Go, mainly). On my main machines, I have many more shell and Perl scripts and proportionately fewer C programs. I have few Python scripts since I learned Perl first and I'm not as fluent in Python. I've been writing and collecting these scripts for a long time; the oldest versions of the oldest programs date back to about 1987.

Is it better to execute a file over the network or copy it locally first?

My winforms app needs to run an executable that's sitting on a share. The exe is about 50MB (it's a setup.exe type of file). My app will run on many different machines/networks with varying speeds (some fast, but some awfully slow, like barely 10baseT speeds).
Is it better to execute the file straight from the share or is it more efficient to copy it locally and then execute it? I am talking in terms of annoying the user the least.
Locally is better. A copy will read each byte of the file a single time, no more, no less. As you execute, you may revisit code that is out of cache, etc and gets pulled again.
As a setup program, I would assume that the engine will want to do some kind of CRC or other integrity check too, which means it's reading the entire file anyway.
It is always better to execute it locally than running it over the network.
If you're application is small, and does not need to load many different resource during runtime then it is ok to run it over the network. It might even be preferable because if you run it over the network the code is read (download and load to memory) once as oppose of manually downloading the file then run it which take 2 read code. For example you can run a clock widget application over the network.
On the other hand, if your application does read a lot of resources during runtim, then it is absolutely a bad idea to run it over the network because each read of the resource will go over the network, which is very slow. For example, you probably don't want to be running Eclipse over the network.
Another factor to take into consideration is how many concurrent user will be accessing the application at the same time. If there are many, you should copy the application to local and run from there.
I believe the OS always copy the file to a local temp folder before it is actually executed. There are no round trips from/to the network after it gets a copy, it only happens once. This is sort of like how a browser works... it first retrieves the file, saves it locally, then it runs if off of the local temp where it saved it. In other words, there is no need to copy it manually unless you want to keep a copy for yourself.

monitoring for changes in file(s) in real time

I have a program that monitors certain files for change. As soon as the file gets updated, the file is processed. So far I've come up with this general approach of handing "real time analysis" in R. I was hoping you guys have other approaches. Maybe we can discuss their advantages/disadvantages.
monitor <- TRUE
start.state <- file.info$mtime # modification time of the file when initiating
while(monitor) {
change.state <- file.info$mtime
if(start.state < change.state) {
#process
} else {
print("Nothing new.")
}
Sys.sleep(sleep.time)
}
Similar to the suggestion to use a system API, this can be also done using qtbase which will be a cross-platform means from within R:
dir_to_watch <- "/tmp"
library(qtbase)
fsw <- Qt$QFileSystemWatcher()
fsw$addPath(dir_to_watch)
id <- qconnect(fsw, "directoryChanged", function(path) {
message(sprintf("directory %s has changed", path))
})
cat("abc", file="/tmp/deleteme.txt")
If your system provides an API for monitoring filesystem changes, then you should use that. I believe Macs come with this. Not sure about other platforms though.
Edit:
A quick goog gave me:
Linux - http://wiki.linuxquestions.org/wiki/FAM
Win32 - http://msdn.microsoft.com/en-us/library/aa364417(VS.85).aspx
Obviously, these APIs will eliminate any polling that you require. On the other hand, they may not always be available.
Java has this: http://jnotify.sourceforge.net/ and http://java.sun.com/developer/technicalArticles/javase/nio/#6
I have a hack in mind: you can setup a CRON job/Scheduled task to run R script every n seconds (or whatever). R script checks the file hash, and if hashes don't match, runs the analysis. You can use digest::digest function, just check out the manual.
If you have lots of files that you want to monitor, then R may be too slow for this purpose. Go to your c: or / dir and see how long it takes to do file.info(dir(recursive = TRUE)). A dos or bash script may be quicker.
Otherwise, the code looks fine.
You could use the tclTaskSchedule function in the tcltk2 package to set up a function that checks for updates and runs your code. This would then be run on a regular basis (you set the timing) but would still allow you to use your R session.
I'll offer another solution for windows that I have been using in a production environment that works perfectly and that I find very easy to set up and, under the hood it basically accesses the system API for monitoring folder changes as others have mentioned, but all the "hard work" is taken care of for you. I use a freely available piece of software called Folder Monitor by Nodesoft and well described here. Once you execute this program, it appears in your system tray and from there you can specify a given directory to monitor. When files are written to the directory (or changed or modified - there are a few options from which you can choose), the program executes any program you like. I simply link the program to a windows batch that that calls my R Script. So for example, I have Folder Monitor set up to monitor a "\myservername\DropOff" UNC path for any new data files written to it. When Folder Monitor detects new files, it executes RunBatch.bat file that simply runs an R script (see here for information on setting that up) that validates the format of the expected file based on an expected naming convention for files received and then it unzips and processes the data, creating a dataframe and ultimately loads that into a SQL Server Database. It just doesn't get any easier.
One note if you decide to use this solution: take a look at the optional delay execution parameter, which might be important if files take a while to copy into the target directory from the source location.

Strategy for handling user input as files

I'm creating a script to process files provided to us by our users. Everything happens within the same UNIX system (running on Solaris 10)
Right now our design is this
User places file into upload directory
Script placed on cron to run every 10 minutes.
Script looks for files in upload directory, processes them, deletes immediately afterward
For historical/legacy reasons, #1 can't change. Also, deleting the file after processing is a requirement.
My primary concern is concurrency. It is very likely that the situation will arise where the analysis script runs while an input file is still being written to. In this case, data will be lost and this (obviously) unacceptable.
Since we have no control over the user's chosen means of placing the input file, we cannot require them to obtain a file lock. As I understand, file locks are advisory only on UNIX. Therefore a user must choose to adhere to them.
I am looking for advice on best practices for handling this problem. Thanks
Obviously all the best solutions involve the client providing some kind of trigger indicating that it has finished uploading. That could be a second file, an atomic move of the file to a processing directory after writing it to a stage directory, or a REST web service. I will assume you have no control over your clients and are unable or unwilling to change anything about them.
In that case, you still have a few options:
You can use a pretty simple heuristic: check the file size, wait 5 seconds, check the file size. If it didn't change, it's probably good to go.
If you have super-user privileges, you can use lsof to determine if anyone has this file open for writing.
If you have access to the thing that handles upload (HTTP, FTP, a setuid script that copies files?) you can put triggers in there of course.

Resources