Non-proprietary directory encryption - encryption

We store measurement results in directories. Each directory has a meta.xml which describes common things about the result file, and several files of data. This result has to be encrypted.
I would dream of a solution like this:
We can use ZIP-, TAR- or a similar algorithm for packing the directory into a file
[optional] We can extend the archive header with our own MIME type (MIME recognition without file extensions)
We can use the encryption algorithm defined in the archive standard (e.g. ZIP) to encrypt/decrypt our result
We can extract single files from the archive, without decrypting the whole file (there are 100Mb files, but most of the time I'm only interested in the meta.xml)
We can use regular tools (7Zip, WinZip, zip on Unix) to access the encrypted file
[optional] We can use more than one key, to encrypt our result file
Is this solution realizable? Are there open-source libraries which do the job? Which encryption algorithm to use?
Best regards!

The use of AES encryption in zip files is supported by PKZip, WinZip, and 7-Zip and is specified in the PKWare zip appnote and well described here: Encryption Specification AE-1 and AE-2. Unfortunately neither Info-ZIP zip nor unzip currently support it (those are what you find on Unixish systems). 7-Zip is open source. As noted, the original zip "encryption" hardly even deserves the name and so should be avoided at all costs. The standardized AES encryption is strong, usable, and relatively widely supported.
Update:
I just noticed another part of your question. Each zip entry can be separately encrypted with a different password, and in fact you can mix unencrypted entries as well in the same zip file.

Related

Should WebDAV servers use text mode to open and store text files?

I am wondering if a WebDAV server should store uploaded files in text-mode, if the mime-type is 'text/...'.
Unix, Windows and Mac OS use different line endings.
Opening a file in write+text mode may convert carriage return / newline according to the servers system convention (which may be different from the WebDAV client).
The obvious alternative would be to store all incoming files as binary blobs, without any conversion.
I see this pros for text-mode:
The text files can be opened on the server using text editors
The uploaded text files may also be easier to interpret by server software (i.e. xml parsers, script processors)
All clients get all text files with the same line ending convention (as defined by the server platform)
I think I have seen implementations doing this
and this cons of text mode
A client cannot expect to GET the same file content that was POSTed.
If a Windows WebDAV client stores a file to a Unix server, the file sizes are different.
I. e. the resources 'size' property is greater than the length of the data returned on GET.
Dangerous: if a file claims mime type 'text/foo', but the file is in fact binary (e.g. a zipped xml file), converting \n \r bytes will corrupt the file.
Text mode may be slower, since processing is required(?)
Am I missing something?
How do common WebDAV servers handle this? Is there a best practice?
I don't know how common WebDAV servers handle this, but I think it's a bad idea.
Just the risk of damaging binary files that are thought to be text, as you mentioned, makes it not worth doing. Here are some more disadvantages:
Breaks the model of the WebDAV server as a simple storage and retrieval device. As a user:
I'd view this as mangling my files.
I'd have to spend time figuring out how and why my files had changed.
Then I'd wonder what other things the server might be doing to my files on my behalf.
Changes line endings based on the server's OS, not the client's (customer's).
If I'm a Windows-only or Unix-only user, then all my line endings are right for me, and I don't want the server changing them. If I use both, then I already have tools that are either insensitive to the line endings or can convert between them.
My experience with text-processing client programs in recent years is that they're all insensitive to line endings. XML parsers and script interpreters, for example, can work with either style of line ending. So I don't see much benefit to offset the risk.

Storing and serving many compressed archives with shared underlying content

I have a web server that has many compressed archive files (zip files) available for download. I would like to drastically reduce the disk footprint those archives take on the server.
The key insight is that those archives are in fact slightly different versions of the same uncompressed content. If you uncompressed any two of these many archives and ran a diff on the results, I expect you would find that the diff is about 1% of the total archive size.
Those archives are actually JAR files, but the compression details are — I believe — irrelevant. But this explains, that serving those archives in a specific compressed format is non-negotiable : it is the basic purpose of the server.
In itself, it is not a problem for me to install differential storage for the content of those archives, drastically reducing the disk footprint of the set of archives. There are numerous ways of doing this, using delta encoding or a compressed filesystem that understands sharing (e.g. I believe btrfs understands block sharing, or I could use snapshotting to enforce it).
The question is, how do I produce compressed zips from those files ? The server I have has very little computational power, certainly not enough to recreate JARs on the fly from the block-sharing content.
Is there a programmatic way to expose the shared content at the uncompressed level to the
compressed level ? An easily-translatable-to-zip incremental compressed format ?
Should I look for a caching solution coupled with generating JARs on the fly ? This would at least alleviate the computational pain from generating the JARs that are the most requested.
There is specialized hardware that can produce zips very fast, but I'd rather avoid the expense. It's also not a very scalable solution as the number of requests to the server grows.
If the 1% differences are smeared across all of the entries in all of the jar files, then there's not much you can do without having to recompress a lot.
If on the other hand the 1% differences are concentrated in a few % of the jar entries, with most of the jar entries unchanged, then there's hope. You can keep all of the individual jar entries in their own jar files on the server, and for each jar file you want to serve, just keep a list of those individual jar entry files to combine. It would be easy to write a fast utility to take a set of jar files and merge them into a single jar file. If there isn't one already.
One approach I've used in the past is to log for some time the actual requests for the zip files. If you find that the requests are highly skewed, then you may be able to use caching to alleviate the cost of producing zip files on the fly.
Basically, implement your differential storage along the lines as you suggest. Allocate also some amount, say 10%, of your total storage for a LRU (or whatever other replacement algorithm you feel like) for the actual .zipped files. Every time a user requests the zip, you serve it from the cache if it is ready, or generate it on the fly and put it in the cache if not.
In the general case this may not work well, but in the common case that actual requests are typically to a small concentrated number of files, it may solve the problem.
Otherwise, I see your options as:
Use delta encoding on disk and then change the format your clients expect for responses. For example, instead of zip, you can serve them a format which is basically the bits of the delta-encoded files they need to reconstruct the file. On the server side, you save most of the work since you are just serving files more or less unmodified from disk, and then the client has to put them together (the existing client already has to unzip the files, so perhaps this is not an undue burden).
Carefully look at the .zip format and store your files in a specialized way that does most of the .zip work ahead of time. For example, something like a delta encoding, but with the actual hard part of match-finding stored on disk, such that encoding a file can be a very fast process. This would require someone with sophisticated knowledge of the zip format to design, however.

Encrypting files added to Mercurial repositories on commit

Having read this past question for git, I would like to ask if there exists something like that, but
can be done programmatically (file list) on each machine;
works for Mercurial.
The reason for this is that I would like to include in my public dotfiles repository some configuration files that store password in plaintext. I know I could write a wraparound script for hg(1) but I would like to know if there are alternative approaches, just for the sake of curiosity.
Thank you.
You could use a pair of pre-commit and post-update hooks to encrypt/decrypt as necessary. See http://hgbook.red-bean.com/read/handling-repository-events-with-hooks.html for more details.
However, it's worth pointing out that if you're storing encrypted text in your repo you'll be unable to create meaningful diffs -- essentially everything will be like a binary file but also poorly compressible.
Mercurial has a filter system that lets you mangle files when they are read from the repository or written back. If you have a program like the SSH agent running that lets you do non-interactive encryption and decryption, then this might just be workable.
As Ryan points out, this will necessarily lead to a bigger repository since each encrypted version of your files will look completely different from the previous version. Mercurial detects this and stores the versions uncompressed (encrypted files cannot be compressed anyway). Since you will use this for dotfiles, you can ignore the space overhead, but it's something to take into consideration if you will be versioning bigger files in encrypted form.
Please post a mail to Mercurial mailing list with your experiences so that other users can benefit from them too.

Encryption for Folders

Is there a directory-encryption variant similar to VIM's "vim -x file"? I am looking for something like "mkdir -encrypt folder".
There is no "general" way to encrypt directories (ie, one that works across all file and operating systems) (see below).
You can, however (as Dante mentioned) use TrueCrypt to create an encrypted filesystem in a file, then mount ("attach", in Windows terminology?) that file.
If you're using Linux, you can even mount that file at a particular directory, to make it appear that the directory is encrypted.
If you want to know how to use TrueCrypt, checkout the docs for Windows here: http://www.truecrypt.org/docs/?s=tutorial and for Linux here: http://www.howtoforge.com/truecrypt_data_encryption (scroll down to the "TrueCrypt Download" heading).
So, a quick explanation why you can encrypt files but not directories:
As far as the "computer" (that is, the hardware, operating system, filesystem drivers, etc) is considered, "files" are just "a bunch of bits on disk" (in the same way a book is "just a bunch of ink on paper"). When a program reads from or writes to a file, it can read or write whatever the heck it wants -- so if that program wants to encrypt some data before writing it to the file, or read a file then decrypt the data that it reads, great.
Directories are a different story, though: to read (ie, list) or write (ie, create) directories, the program (be it, mkdir, ls, Windows Explorer or Finder) has to ask the operating systeme, then the operating system asks the filesystem driver "Hey, can you make the directory /foo/bar?" or "hey, can you tell me what's in /bar/baz?" -- all the program or operating system see (basically) is a function to make directories and a function to list the contents of a directory.
So, to encrypt a directory, you can see that it would have to be the filesystem driver that is doing the encryption, not the program creating/listing the directories... And no modern filesystems support per-directory encryption.
On Linux, the simplest way is probably to use EncFS
"EncFS provides an encrypted filesystem in user-space. It runs without any special permissions and uses the FUSE library and Linux kernel module to provide the filesystem interface."
it basically mounts an encrypted folder as a plain one.
More info on wikipedia
TrueCrypt Its open source and supports multiple types of encryption.. What operating system do you wish to know about?
Edit: Windows Vista/XP, Mac OS X, and Linux are all supported.
I would recommend Enterprise Cryptographic Filesystem i.e. ecryptfs found in apt-get as ecryptfs-utils in Debian/Ubuntu because more flexible than TrueCrypt.
It is probably one of the strongest way here to encrypt the directory.
It can be used with two passwords: login passhrase and password so making it a kind of double password system.
It is also POSIX implemented.
The limitation of this system like many other encryption systems is that it supports only filenames/directory names up to 144, in contrast to 255 Linux standard.
Maintained four years and last update 4 months ago so a good thing for future.
Comparison between TrueCrypt and encryptfs from this blog post
Truecrypt is simulated hardware encryption. It creates a virtual
encrypted hard disk, which your operating system can more or less
treat like an ordinary hard disk, but for the kernel hooks Truecrypt
adds to lock and unlock the disk. EcryptFS is an encrypted filesystem.
Unlike Truecrypt, which encrypts individual disk blocks, systems like
EcryptFS encrypt and decrypt whole files.
and more comparasion between the two systems here:
Those complications (and the fact that ecryptfs is slower) are part of
why people like block-level encryption like TrueCrypt, but I do
appreciate the flexibility of ecryptfs.

Which archiving utility should I use in Ubuntu?

I am a Mac/Ubuntu user. I have folders such as "AWK", "awk", "awk_tip" and "awk_notes". I need to archive them, but the variety of utilities confuse me. I had a look at Tar, cpio and pax, but Git has started to fascinate me. I occasionally need encryption and backups.
Please, list the pros and cons of different archiving utilities.
Tar, cpio and pax are ancient Unix utilities. For instance, tar (which is probably the most common of these) was originally intended for making backups on tapes (hence the name, tar = tape archive).
The most commonly used archive formats today are:
tar (in Unix/Linux environments)
tar.gz or tgz (a gzip compressed tar file)
zip (in Windows environments)
If you want just one simple tool, take zip. It works right out of the box on most platforms, and it can be password protected (although the protection is technically weak).
If you need stronger protection (encryption), check out TrueCrypt. It is very good.
Under what OS / toolchain are you working? This might limit the range of existing solutions. Your name suggests Unix, but which one? Further, do you need portability or not?
The standard linux solution (at least to a newbie like me) might be to tar and gzip or bzip2 the folders, then encrypt them with gnupg if you really have to (encrypting awk tutorials seems a bit of overkill to me). You can also use full-fledged backup solutions like bacula, sync to a different location with rsync (perhaps sync to a backup server?).
If you've backing up directories from an ext2/ext3 filesystem, you may want to consider using dump. Some nice features:
it can backup a directory or a whole partition
it saves permissions and timestamps,
it allows you to do incremental backups,
it can compress (gzip or bzip2)
it will automatically split the archive into multiple parts based on a size-limit if you want
it will backup over a network or to a tape as well as a file
It doesn't support encryption, but you can always encrypt the dump files afterwards.

Resources