Storing and serving many compressed archives with shared underlying content

Storing and serving many compressed archives with shared underlying content - jar

I have a web server that has many compressed archive files (zip files) available for download. I would like to drastically reduce the disk footprint those archives take on the server.
The key insight is that those archives are in fact slightly different versions of the same uncompressed content. If you uncompressed any two of these many archives and ran a diff on the results, I expect you would find that the diff is about 1% of the total archive size.
Those archives are actually JAR files, but the compression details are — I believe — irrelevant. But this explains, that serving those archives in a specific compressed format is non-negotiable : it is the basic purpose of the server.
In itself, it is not a problem for me to install differential storage for the content of those archives, drastically reducing the disk footprint of the set of archives. There are numerous ways of doing this, using delta encoding or a compressed filesystem that understands sharing (e.g. I believe btrfs understands block sharing, or I could use snapshotting to enforce it).
The question is, how do I produce compressed zips from those files ? The server I have has very little computational power, certainly not enough to recreate JARs on the fly from the block-sharing content.
Is there a programmatic way to expose the shared content at the uncompressed level to the
compressed level ? An easily-translatable-to-zip incremental compressed format ?
Should I look for a caching solution coupled with generating JARs on the fly ? This would at least alleviate the computational pain from generating the JARs that are the most requested.
There is specialized hardware that can produce zips very fast, but I'd rather avoid the expense. It's also not a very scalable solution as the number of requests to the server grows.

If the 1% differences are smeared across all of the entries in all of the jar files, then there's not much you can do without having to recompress a lot.
If on the other hand the 1% differences are concentrated in a few % of the jar entries, with most of the jar entries unchanged, then there's hope. You can keep all of the individual jar entries in their own jar files on the server, and for each jar file you want to serve, just keep a list of those individual jar entry files to combine. It would be easy to write a fast utility to take a set of jar files and merge them into a single jar file. If there isn't one already.

One approach I've used in the past is to log for some time the actual requests for the zip files. If you find that the requests are highly skewed, then you may be able to use caching to alleviate the cost of producing zip files on the fly.
Basically, implement your differential storage along the lines as you suggest. Allocate also some amount, say 10%, of your total storage for a LRU (or whatever other replacement algorithm you feel like) for the actual .zipped files. Every time a user requests the zip, you serve it from the cache if it is ready, or generate it on the fly and put it in the cache if not.
In the general case this may not work well, but in the common case that actual requests are typically to a small concentrated number of files, it may solve the problem.
Otherwise, I see your options as:
Use delta encoding on disk and then change the format your clients expect for responses. For example, instead of zip, you can serve them a format which is basically the bits of the delta-encoded files they need to reconstruct the file. On the server side, you save most of the work since you are just serving files more or less unmodified from disk, and then the client has to put them together (the existing client already has to unzip the files, so perhaps this is not an undue burden).
Carefully look at the .zip format and store your files in a specialized way that does most of the .zip work ahead of time. For example, something like a delta encoding, but with the actual hard part of match-finding stored on disk, such that encoding a file can be a very fast process. This would require someone with sophisticated knowledge of the zip format to design, however.

Related

How to download ONLY the metadata from an mp3 file?

I'm making an application which plays music from a remote server, and I would like to be able to sort by author/album/year/etc. AFAIK the only way to do this is by reading the metadata but I don't want to have to download the whole audio file just to read the metadata, is there any way to separate them and download only the metadata?
BTW. I am using webdav_client for flutter, which uses dio as a back-end so instructions for that specifically would be greatly appreciated. TY

Firstly, you can usually make a request for a certain byte range by using ranged requests. This is dependent on server behavior. Most servers support it, but many don't.
Next, you need to figure out the location of the ID3 tags you want. Some versions of ID3 are located at the front of the file. Some are at the back. Therefore, you should probably request the first 128 KB or so of the file and search for ID3 data, while also getting the Length response header. Then if you don't find your tag at the beginning, you can make a request for the last 128 KB or whatever of the file, and search there.
Most MP3 files aren't very big, and bandwidth is usually plentiful. Depending on the size and scope of this project, you might actually find it more efficient to just download the whole files.

I don't think that it is possible to read just the ID3-metadata (at the beginning or the end) of the audiofile without downloading the entire file first.
One idea would be to extract this information on the server side and provide it separately, in addition to the audio file itself. To do this, you would need one of the well-known extraction tools available for your platform. However, if you need to download hundreds or thousands of companion files, I am not sure about the reliability of such a system.

Overcoming inode limitation

What's the best practise for storing a large (expanding) number of small files on a server, without running into inode limitations?
For a project, I am storing a large number of small files on a server with 2TB HD space, but my limitation is the 2560000 allowed inodes. Recently the server used up all the inodes and was unable to write new files. I subsequently moved some files into databases, but others (images and json files remain on the drive). I am currently at 58% inode usage, so a solution is required imminently.
The reason for storing the files individually is to limit the number of database calls. Basically the scripts will check if the file exists and if so, then return results dependently. Performance wise this makes sense for my application, but as stated above it has limitations.
As I understand it does not help to move the files into sub-directories, because each inode points to a file (or a directory file), so in fact I would just use up more inodes.
Alternatively I might be able to bundle the files together in an archive type of file, but that will require some sort of indexing.
Perhaps I am going about this all wrong, so any feedback is greatly appreciated.

On the advice of arkascha I looked into loop devices and found some documentation about losetup. Remains to be tested.

A file storage format for file sharing site

I am implementing a file sharing system in ASP.NET MVC3. I suppose most file sharing sites store files in a standard binary format on a server's file system, right?
I have two options storage wise - a file system, or binary data field in a database.
Is there any advantages in storing files (including large one's) in a database, rather then on file system?
MORE INFO:
Expected average file size is 800 MB. 3 files per minute are to be usually requested to be fed back to the user, who is downloading.

If the files are as big as that, then using the filesystem is almost certainly a better option. Databases are designed to contain relational data grouped into small rows and are optimized for consulting and comparing the values in these relations. Filesystems are optimized for storing fairly large blobs and recalling them by name as a bytestream.
Putting files that big into a database will also make it difficult to manage the space occupied by the database. The tools to query space used in a filesystem, and remove and replace data are better.
The only caveat to using the filesystem is that your application has to run under an account that has the necessary permission to write the (portion of the) filesystem you use to store these files.

Use FileStream when:
Objects that are being stored are, on average, larger than 1 MB.
Fast read access is important.
You are developing applications that use a middle tier for application logic.
Here is MSDN link https://msdn.microsoft.com/en-us/library/gg471497.aspx
How to use it: https://www.simple-talk.com/sql/learn-sql-server/an-introduction-to-sql-server-filestream/

how to prevent uploading of exe file in asp.net mvc

I am looking for a good solution by which we can prevent an exe file to be uploaded on server.
It will be best if we can discard the upload by just reading the file headers as soon as we receive them rather than waiting for entire file to upload.
I have already implemented the extension check, looking for a better solution.

There is a how and a when/where part. The how is fairly simple, as binary files do contain a header and the header is fairly easy to strip out and check. For windows files, you can check the article Executable-File Header Format. Similar formats are used for other binary types, so you can determine types you allow and those you do not.
NOTE: Linked article is for full querying of the file. There are cheap, down and dirty, shortcuts where you only examine a few bytes.
The when/where depends on how you are getting the files. If you are using a highly abstracted methodology (upload library), which is fairly normal, you may have to stream the entire file before you can start querying the bits. Whether it is streamed into memory or you have to save and delete depends on your coding and possibly even the library. If you control the streaming up, you have the ability to stream in the first bytes (header portion) and abort the process in mid stream.

The first point of access to uploaded data would be in a HttpModule.
Technically you can check before all the bytes are sent if you have an .exe on your hands and cancel the upload. It can get quite complicated depending on how far you want to take this.
I suggest you look at the HttpModule of Brettle's NeatUpload. Maybe it gives you a lead on how to deal with this on the level you want.

I think you can do that by a javascript by checking if the file end with .exe before submitting the data and also do the check server side.

What's the best way to sync large amounts of data around the world?

I have a great deal of data to keep synchronized over 4 or 5 sites around the world, around half a terabyte at each site. This changes (either adds or changes) by around 1.4 Gigabytes per day, and the data can change at any of the four sites.
A large percentage (30%) of the data is duplicate packages (Perhaps packaged-up JDKs), so the solution would have to include a way of picking up the fact that there are such things lying aruond on the local machine and grab them instead of downloading from another site.
The control of versioning is not an issue, this is not a codebase per-se.
I'm just interested if there are any solutions out there (preferably open-source) that get close to such a thing?
My baby script using rsync doesn't cut the mustard any more, I'd like to do more complex, intelligent synchronization.
Thanks
Edit : This should be UNIX based :)

Have you tried Unison?
I've had good results with it. It's basically a smarter rsync, which maybe is what you want. There is a listing comparing file syncing tools here.

Sounds like a job for BitTorrent.
For each new file at each site, create a bittorrent seed file and put it into centralized web-accessible dir.
Each site then downloads (via bittorrent) all files. This will gen you bandwidth sharing and automatic local copy reuse.
Actual recipe will depend on your need.
For example, you can create 1 bittorrent seed for each file on each host, and set modification time of the seed file to be the same as the modification time of the file itself. Since you'll be doing it daily (hourly?) it's better to use something like "make" to (re-)create seed files only for new or updated files.
Then you copy all seed files from all hosts to the centralized location ("tracker dir") with option "overwrite only if newer". This gets you a set of torrent seeds for all newest copies of all files.
Then each host downloads all seed files (again, with "overwrite if newer setting") and starts bittorrent download on all of them. This will download/redownload all the new/updated files.
Rince and repeat, daily.
BTW, there will be no "downloading from itself", as you said in the comment. If file is already present on the local host, its checksum will be verified, and no downloading will take place.

How about something along the lines of Red Hat's Global Filesystem, so that the whole structure is split across every site onto multiple devices, rather than having it all replicated at each location?
Or perhaps a commercial network storage system such as from LeftHand Networks (disclaimer - I have no idea on cost, and haven't used them).

You have a lot of options:
You can try out to set up replicated DB to store data.
Use combination of rsync or lftp and custom scripts, but that doesn't suit you.
Use git repos with max compressions and sync between them using some scripts
Since the amount of data is rather large, and probably important, do either some custom development on hire an expert ;)

Check out super flexible.... it's pretty cool, haven't used it in a large scale environment, but on a 3-node system it seemed to work perfectly.

Sounds like a job for Foldershare

Have you tried the detect-renamed patch for rsync (http://samba.anu.edu.au/ftp/rsync/dev/patches/detect-renamed.diff)? I haven't tried it myself, but I wonder whether it will detect not just renamed but also duplicated files. If it won't detect duplicated files, then, I guess, it might be possible to modify the patch to do so.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex