Best method for dealing with Unix compressed files (.Z) in IDL? - unix

I'm working on some code in IDL that retrieves data files through FTP that are Unix compressed (.Z) files. I know IDL can work with .gz compressed files with the /compress keyword however it doesn't seem capable of playing nicely with the .Z compression.
What are my options for working with these files? The files I am downloading are coming from another institution so I have no control in the compression being used. Downloading and decompressing the files manually before running the code is an absolute last resort as it makes things a lot more difficult as I don't always know which files I need from the FTP site in advance so the code grabs the ones needed based on the parameters in real time.
I'm currently running on Windows 7 but once the code is finished it will be used on a Unix system as well (computer cluster).

You can use SPAWN as you note in your comment (assuming you can find an equivalent of the Unix uncompress command that runs on Windows), or for higher speed you can use an external C function with CALL_EXTERNAL to do the decompression. Just by coincidence, I posted an answer on stackexchange the other day with just such a C function to decompress .Z files here.

Related

Replacing static files that are under heavy read load

Let us assume we have a static file server (Nginx + Linux) that serves 10 files. The files are read almost as frequently as the server can process. However, some of the files need to be replaced with new versions, so that the filename and URL address remain unaltered. How to replace the files safely without a fear that some reads fail or become a mix of two versions?
I understand this is a rather basic operating system matter and has something to do with renames, symlinks, and file sizes. However, I failed to find a clear reference or a good discussion and I hope we can build one here.
Use rsync. Typically I choose rsync -av src dst, but YMMV.
What is terrific about rsync is that, in addition to having essentially zero cost when little or nothing changed, it uses atomic rename. So during file transfer, a ".fooNNNNN" temp file gets bigger and bigger. Once completed, rsync closes the file and renames it on top of "foo". So web clients either see all of the old, or all of the new file. Notice that range downloads (say from restart after error) are not atomic, exposing such clients to lossage, especially if bytes were inserted near beginning of file. SHA1 wouldn't validate for such a client, and he would have to restart his download from scratch. BTW, if these are "large" files, tell nginx to use zero-copy sendfile().

MPI parallel write to a TIFF file

I'm trying to write a TIFF file in MPI code. Different processors have different parts of the image, and I want to write the image to the file in parallel.
The write fails, only the 1st processor can write to it.
How do I do this?
There is no error in my implementation, just it does not work.
I used h=TIFFOpen(file, "a+") on each processor to open the same file (I am not sure whether this is a right way or not), then each processor who is responsible for a directory will write the header at its own place using TIFFSetDirectory(h, directorynumber), then the content of each directory will be written. I will finalize with TIFFWriteDirectory(h). The result would be the first directory which is written on the file.
I thought that I need to open the file using MPI_IO but doing this way it is not TIFFOpen?
Different MPI tasks are independent programs, running on independent hosts from the OS point of view. In your case the TIFF library is not designed to handle parallel operations, so opening the file will lead the first process to succeed, all the rest to fail because they found the file already opened (on a shared filesystem).
Except in case you are dealing with huge images (eg: astronomical images) where it's important for performance to perform parallel I/O (you need a filesystem supporting it however... I am aware of IBM GPFS), I would avoid to write a custom TIFF driver with MPI_IO.
Instead the typical solution is to gather (MPI_Gather()) the image parts on the process with rank==0 and let it only save the tiff file.

Storing and serving many compressed archives with shared underlying content

I have a web server that has many compressed archive files (zip files) available for download. I would like to drastically reduce the disk footprint those archives take on the server.
The key insight is that those archives are in fact slightly different versions of the same uncompressed content. If you uncompressed any two of these many archives and ran a diff on the results, I expect you would find that the diff is about 1% of the total archive size.
Those archives are actually JAR files, but the compression details are — I believe — irrelevant. But this explains, that serving those archives in a specific compressed format is non-negotiable : it is the basic purpose of the server.
In itself, it is not a problem for me to install differential storage for the content of those archives, drastically reducing the disk footprint of the set of archives. There are numerous ways of doing this, using delta encoding or a compressed filesystem that understands sharing (e.g. I believe btrfs understands block sharing, or I could use snapshotting to enforce it).
The question is, how do I produce compressed zips from those files ? The server I have has very little computational power, certainly not enough to recreate JARs on the fly from the block-sharing content.
Is there a programmatic way to expose the shared content at the uncompressed level to the
compressed level ? An easily-translatable-to-zip incremental compressed format ?
Should I look for a caching solution coupled with generating JARs on the fly ? This would at least alleviate the computational pain from generating the JARs that are the most requested.
There is specialized hardware that can produce zips very fast, but I'd rather avoid the expense. It's also not a very scalable solution as the number of requests to the server grows.
If the 1% differences are smeared across all of the entries in all of the jar files, then there's not much you can do without having to recompress a lot.
If on the other hand the 1% differences are concentrated in a few % of the jar entries, with most of the jar entries unchanged, then there's hope. You can keep all of the individual jar entries in their own jar files on the server, and for each jar file you want to serve, just keep a list of those individual jar entry files to combine. It would be easy to write a fast utility to take a set of jar files and merge them into a single jar file. If there isn't one already.
One approach I've used in the past is to log for some time the actual requests for the zip files. If you find that the requests are highly skewed, then you may be able to use caching to alleviate the cost of producing zip files on the fly.
Basically, implement your differential storage along the lines as you suggest. Allocate also some amount, say 10%, of your total storage for a LRU (or whatever other replacement algorithm you feel like) for the actual .zipped files. Every time a user requests the zip, you serve it from the cache if it is ready, or generate it on the fly and put it in the cache if not.
In the general case this may not work well, but in the common case that actual requests are typically to a small concentrated number of files, it may solve the problem.
Otherwise, I see your options as:
Use delta encoding on disk and then change the format your clients expect for responses. For example, instead of zip, you can serve them a format which is basically the bits of the delta-encoded files they need to reconstruct the file. On the server side, you save most of the work since you are just serving files more or less unmodified from disk, and then the client has to put them together (the existing client already has to unzip the files, so perhaps this is not an undue burden).
Carefully look at the .zip format and store your files in a specialized way that does most of the .zip work ahead of time. For example, something like a delta encoding, but with the actual hard part of match-finding stored on disk, such that encoding a file can be a very fast process. This would require someone with sophisticated knowledge of the zip format to design, however.

Determine file compression type

I backed up a large number of files to S3 from a PC before switching to a Mac several months ago. Several months later, I'm now trying to open the files and realized the files were all compressed by the S3 GUI tool I used so I can not open them.
I can't remember what program I used to upload the files and standard decompression commands from the command line are not working e.g.,
unzip
bunzip2
tar -zxvf
How can I determine what the compression type is of the file? Alternatively, what other decompression techniques can I try?
PS - I know the files are not corrupted because I tested downloading and opening them back when I originally uploaded to S3.
You can use Universal Extractor (open source) to determine compression types.
Here is a link: http://legroom.net/software/uniextract/
The little downside is that it looks in the first place for the extension, but I manage to change the extensions myself for a inknown file and it works almost always, eg .rar or .exe etc..
EDIT:
I found a huge list of archive programs, maybe one of them will work? It's ridiciously big:
http://www.maximumcompression.com/data/summary_mf.php
http://www.maximumcompression.com/index.html

How do I Download efficiently with rsync?

A couple of questions related to one theme: downloading efficiently with Rsync.
Currently, I move files from an 'upload' folder onto a local server using rsync. Files to be moved are often dumped there, and I regularly run rsync so the files don't build up. I use '--remove-source-files' to remove files that have been transferred.
1) the '--delete' options that remove destination files have various options that allow you to choose when to remove the files. This would be handly for '--remove-source-files' since is seems that, by default, rsync only removes the files after all files have been transferred, rather than after each file; Othere than writing a script to make rsync transfer files one-by-one, is there a better way to do this?
2) on the same problem, if a large (single) file is transferred, it can only be deleted after the whole thing has been sucessfully moved. It strikes me that I might be able to use 'split' to split the file up into smaller chunks, to allow each to be deleted as the file downloads; is there a better way to do this?
Thanks.

Resources